Elsevier

Signal Processing

Volume 75, Issue 2, June 1999, Pages 151-159
Signal Processing

Improved noise suppression filter using self-adaptive estimator of probability of speech absence

https://doi.org/10.1016/S0165-1684(98)00230-8Get rights and content

Abstract

In this paper, two estimators of the probability of speech absence are derived using the common assumption that the Fourier coefficients of a frame of speech and noise samples are statistically independent Gaussian random variables (Ephraim and Malah, 1984; McAulay and Malpass, 1980). The estimators are obtained directly from the noisy speech itself. The first estimator is obtained by binary classification of the received spectral amplitude into speech present or speech absent state. The second estimator is obtained by deriving the conditional probability of speech absence given the received spectral amplitude. Each of the time-adaptive estimators produces an estimate of the probability of speech absence for each spectral frequency. The estimated probability will be higher during the speech period and lower during the silence period. The estimated probability can be fed directly to any filter which requires such an estimate, e.g. the Ephraim and Malah noise suppressor (Ephraim and Malah, 1984), and the modified power subtraction method (Scalart and Vieira Filho, 1996), with significant improvements for various noise types.

Zusammenfassung

In diesem Artikel werden zwei Schätzer für die Wahrscheinlichkeit des Nichtvorhandenseins eines Sprachsignals unter Verwendung der üblichen Annahme abgeleitet, daß die Fourierkoeffizienten eines Frames von Sprach-und Geräuschabtastwerten unabhängige, gaußverteilte Zufallsvariablen sind (Ephraim and Malah, 1984; McAulay and Malpass, 1980). Die Schätzer werden direkt aus dem verrauschten Sprachsignal bestimmt. Der erste Schätzer wird durch binäre Klassifikation der Amplitude des empfangenen Spektrums in die Klassen “Sprache vorhanden” und “Sprache nicht vorhanden” erhalten. Der zweite Schätzer wird durch die Herleitung der bedingten Wahrscheinlichkeit des Nichtvorhandenseins eines Sprachsignals bei gegebenem empfangenem Amplitudenspektrum bestimmt. Jeder der zeitlich adaptiven Schätzer erzeugt einen Schätzwert für die Wahrscheinlichkeit des Nichtvorhandenseins eines Sprachsignals für jede spektrale Frequenz. Die geschätzte Waherscheinlichkeit wird während der Sprechphasen höher sein und niedriger bei Schweigen. Die geschätzten Wahrscheinlichkeiten können direkt jedem Filter zugeführt werden, das solch eine Schätzung benötigt, beispielsweise der Ephraim und Malah Rauschunterdrücker (Ephraim and Malah, 1984) und die modifizierte Leistungssubtraktionsmethode (Scalart and Vieira Filho, 1996), was zu signifikanten Verbesserungen für verschiedenste Rauscharten führt.

Résumé

Dans cet article, nous dérivons deux estimateurs de la probabilité d'absence de parole, en utilisant la supposition commune que les coefficients de Fourier des échantillons d'une trame de parole et du bruit sont des variables aléatoires gaussiennes indépendantes (Ephraim et Malah, 1984; McAulay et Malpass, 1980). Les estimateurs sont obtenus directement à partir de la parole bruitée elle-même. Le premier estimateur est obtenu par classification binaire de l'amplitude spectrale reçue en un état de présence ou d'absence de parole. Le second estimateur est obtenu en dérivant la probabilité conditionnelle d'absence de parole étant donnée l'amplitude spectrale reçue. Chacun des estimateurs adaptatifs dans le temps produit une estimation de la probabilité d'absence de parole pour chaque fréquence spectrale. La probabilité sera plus grande durant une période de parole et plus basse durant une période de silence. La probabilité estimée peut être directement entrée dans n'importe quel filtre qui nécessite une telle estimation, par exemple le suppresseur de bruit de Ephraim et Malah (1984) ou la méthode de soustraction de puissance modifiée (Scalart et Vieira Filho, 1996), produisant une amélioration significative pour différents types de bruit.

Introduction

There are two forms of speech absence which can be used to improve the noise removal filter. The first form of speech absence is due to the speaker pausing in his speech resulting in significant portions of silence. The second form of speech absence is that although the speaker is talking, the speech energy is not present in all the frequency components. For some frequency components with insignificant energy, speech can be considered to be absent in those components.

Such knowledge of speech absence can be used to improve a speech enhancement filter. The first attempt at utilizing the uncertainty of speech absence was explored by McAulay and Malpass [2]. In their approach, they derived a filter based on a fixed probability of speech absence of 0.5. The Ephraim and Malah noise removal filter [1] adopted a more flexible approach in which different spectral frequency components can be assigned a different probability of speech absence which ranges from zero to one. However, the paper did not touch on how the probability of speech absence can be estimated, and for performance evaluation, the probability of speech absence was set to 0.2 empirically. Intuitively, we expect the probability of speech absence to be a function of time and frequency.

In this paper, we formulate two approaches which adaptively estimate the probabilities of speech absence in different frequency components from the noisy speech itself. In the first approach, the noisy spectral component is hard classified or binary classified into speech presence or speech absence. In the second approach, the noisy spectral component is soft classified or statistically classified as speech absence, e.g. for a certain spectral component, the probability of speech absence is 0.6. After the classification stage, the probability of speech absence is computed using a running average of the classification results.

Both methods provide a much better estimate of speech absence which varies with time and frequency rather than a constant value. During the periods where speech is absent, the probability of speech absence will be close to one while during voiced speech the probability of speech absence will be close to zero at the pitch frequency component.

The probabilities obtained are then fed into a slightly modified form of Ephraim and Malah filter which takes into account the uncertainty of signal presence. The results show both an improvement in speech quality as well as better segmental SNR values. Similarly, the technique can be applied to other filters which require the probability of speech absence input, e.g. the modified power subtraction method using a priori SNR proposed in [3]. The results obtained also show significant improvements.

Section snippets

Ephraim and Malah noise suppression filter

This section provides a brief description of the Ephraim and Malah noise suppression filter [1], which gives excellent results. Let the kth spectral magnitude of the speech signal, noise and noisy speech be denoted by Ak,Dk and Rk, respectively. The probability of speech absence is denoted as qk. The kth spectral output, Âk, of the Ephraim and Malah noise suppression filter, taking into account the uncertainty of signal presence, is given by the equationÂk=G(qk)M(−0.5;1;−vk)Rkπvkk,wherevk=ξk

Modified power subtraction filter

This modified power subtraction filter is a speech enhancement filter proposed by Scalart and Vieira Filho [3] using the power subtraction technique together with the a priori SNR estimated by the decision-directed approach [1] in Eq. (8). The additional attentuation for silence period can also be incorporated for a better performance. Using the same notations as in the above section, the combined filter can be described as follows:Âk=G(qk)ξ̂kξ̂k+1Rk.

Hard decision estimator

The first method will be hard to classify a received noisy amplitude as one which contains speech or just noise alone. The decision will be binary, with 0 representing speech presence and 1 representing speech absence. Let the input be represented by two states, H0 and H1, where H0: speech absence, H1: speech presence. Using Gaussian statistical model in 1, 2, the conditional probability density function of receiving the noisy amplitude, Rk, given that speech is absent, isP(Rk|H0)=2Rkλdexpk).

Soft decision estimator

Unlike the Hard decision method which classifies the received amplitude into either speech presence or speech absence in a binary fashion, the soft decision method produces a value which ranges from 0 to 1 to represent the probability that the received amplitude is from a speech absence state. Using the conditional probabilities, the probability that speech is absent can be obtained from Bayes theorem:P(H0|Rk)=P(Rk|H0)P(H0)P(Rk|H0)P(H0)+P(Rk|H1)P(H1).

However, values of P(H0) and P(H1) are

Results and discussions

A total of 10 different utterances, taken from the TIMIT database, are used in our evaluation. Half of the utterances were from male speakers while the rest are from female speakers. The speech data used are sampled at 8 kHz and quantized linearly using 16 bits. As for the additive noise, three different noise types were used, namely Gaussian white noise, recorded fan noise as well as the F16 (fighter jet) noise from the NOISEX database.

The noisy speech data were divided into frames, each of

Conclusions

This paper proposes two methods of estimating the probability of speech absence adaptively from the noisy speech itself. It shows that by using these two estimators of speech absence, the performance of the Ephraim and Malah noise filter and the power subtraction filter can be significantly improved. The technique should also be applicable in other filters incorporating the probability of speech absence.

References (6)

  • Y. Ephraim, D. Malah, Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator,...
  • R.J. McAulay, M.L. Malpass, Speech enhancement using a soft-decision noise suppression filter, IEEE Trans. Acoust....
  • P. Scalart, J. Vieira Filho, Speech enhancement based on a a priori signal to noise estimation, in: Proc. ICASSP, Vol....
There are more references available in the full text version of this article.

Cited by (55)

  • Distributed-microphones based in-vehicle speech enhancement via sparse and low-rank spectrogram decomposition

    2018, Speech Communication
    Citation Excerpt :

    A robust adaptive microphone array with improved spatial selectivity can obtain high quality signals in a certain direction and reduce the noise in other distractions. An earlier microphone array speech enhancement method is post-filtering (Zelinski, 1988; Soon et al., 1999) method. It hypothesizes that the signals acquired by microphone array are correlated, and the noise is independent and identically distributed.

  • Generalized maximum a posteriori spectral amplitude estimation for speech enhancement

    2016, Speech Communication
    Citation Excerpt :

    Additionally, some spectral restoration approaches are derived based on probabilistic models of speech and noise signals. Successful examples include minimum mean-square-error (MMSE) spectral estimator (Ephraim and Malah, 1984; Soon et al., 1999; Martin, 2005; Hansen et al., 2006; Malah et al., 1999; Cohen, 2002), maximum a posteriori spectral amplitude (MAPA) estimator (Plourde and Champagne, 2008; Lotter and Vary, 2005; Suhadi et al., 2011; Li et al., 2006; Xin et al., 2008), and maximum likelihood spectral amplitude (MLSA) estimator (McAulay and Malpass, 1980; Kjems and Jensen, 2012). In probabilistic model based estimation, the maximum a posteriori (MAP) based criterion explicitly takes a certain prior distribution of signal in modeling, which results in a much more accurate estimation than that without taking the prior knowledge of signal distributions.

  • 2D Psychoacoustic modeling of equivalent masking for automatic speech recognition

    2015, Signal Processing
    Citation Excerpt :

    Notably, the human auditory system shows a much better resistance to the effects of noise [1,7]. The human auditory system can work relatively well in adverse situations where there is unpredictable environmental noise and distortion [8–11]. For example, a person with a healthy auditory system has little difficulty in communicating with other people in a crowded shopping mall, which would be a very challenging task for modern ASR [12–15].

  • Voiced/nonvoiced detection in compressively sensed speech signals

    2015, Speech Communication
    Citation Excerpt :

    In order to further improve the performance, some methods learn statistical models after initial clustering is done (Ying et al., 2011). In addition, few methods extract features after using noise suppression as a part of the process e.g., use of enhanced speech spectra derived from Wiener filtering based on estimated noise statistics (Soon et al., 1999). Due to advantages such as no requirement of labeled training data, less training time and computational complexity, unsupervised methods are preferred over supervised methods (Sadjadi et al., 2013).

View all citing articles on Scopus
View full text