Elsevier

Signal Processing

Volume 87, Issue 9, September 2007, Pages 2026-2035
Signal Processing

Variational Bayesian learning for speech modeling and enhancement

https://doi.org/10.1016/j.sigpro.2007.01.035Get rights and content

Abstract

A new variational Bayesian (VB) learning approach for speech modeling and enhancement is proposed in this paper. We choose time-varying autoregressive process to model clean speech signal and use VB learning to estimate the model parameters and the clean signal in an integrated manner. Our presented algorithm efficiently exploits prior information and statistical structures of speech model and noise characteristic. Furthermore, it can automatically choose the model order and avoid overfitting in the estimation. Experimental results compared with other methods can demonstrate the performance of our algorithm.

Introduction

In practice, clean signals are rarely available. Speech is often contaminated by some background or some application-specific noise processes. Therefore, speech enhancement plays an important role in speech recognition, speech analysis, etc. There exist many methods for speech enhancement including spectral subtraction (SS) [1] and thresholding shrinkage [3], [4] which removes the additive noise by shrinking the coefficients with small absolute values considered as noise. The major problem with SS is the annoying nonstationary musical background associated with the enhanced speech. Thresholding methods assume that prior knowledge on the variance of the additive Gaussian noise is known. However, in many practical problems, the variance of the noise is unknown and estimated by the coefficients considered to be noise coefficients. An alternative method is a model-based speech enhancement approach, which applies an autoregressive (AR) process to model speech production system [5], [6]. In fact, speech signal is nonstationary for long duration. Time invariant AR model processes speech signal in short segments, during each of which the signal is assumed stationary. However, nonstationary frames still exist even for very short segments. Time-varying autoregressive (TVAR) process where AR coefficients are allowed to vary with time may provide more appropriate and accurate representation for speech. It reflects the nonstationarity of speech signal. Additionally, this model can analyse longer speech signal. The enhanced algorithm based on TVAR model compared to some other methods improves the quality and signal-to-noise ratio (SNR) of speech. However, how to model and estimate the clean speech from noisy signal is still a challenge.

In this paper, we model speech as a TVAR process with stochastically evolving parameters observed in additive white Gaussian noise [7], [8]. Our objective is to estimate the clean speech and TVAR parameters directly from noisy signal. The enhancement problem and estimation objective are considered within a Bayesian framework. Unfortunately, maximum a posterior inference procedure is analytically intractable, requiring numerical methods for its solution [7]. Markov chain Monte Carlo methods attempt to achieve exact results but typically require vast computational resources, and become impractical for a complex model in high data dimensions. Moreover, these two methods cannot determine the model order automatically. The variational Bayesian (VB) approximation derived from mean-field theorem in statistical mechanics is a practical framework for Bayesian computation. This framework facilitates analytical calculations of posterior distributions over hidden variables, parameters and structures [9]. Therefore, we use VB learning to solve the enhancement problem and realize estimates of TVAR parameters and clean speech signal in an integrated manner. The advantages of our algorithm are that it avoids overfitting in the estimation process and performs model order selection automatically. This would lead to degradation in speech quality and may cause musical noise in enhanced speech if there is overfitting.

The remainder of the paper is organized as follows. In Section 2, probabilistic formulations for speech modeling and enhancement are presented. Then we estimate model parameters and clean speech by VB approximation in Section 3. Section 4 includes experiments which demonstrate the performance of our algorithm. We draw a conclusion in Section 5.

Section snippets

Probabilistic formulations for speech modeling and enhancement

In this section, the definitions of all used probabilistic distributions can be seen in Appendix A. Speech signal st can be modeled using a pth order TVAR process in the following:st=wtTst(p)+et,etG(0,β),where subscript t is a time index and t∈{1,2,…, T}, wt=[wt(1)wt(2),,wt(p)]T denotes the time-varying coefficient vector of the model, the time-varying coefficient set is denoted by W={w1,w2,,wT}, st(p)=[st-1st-2,,st-p]T denotes the values at the previous time which st depends on, et

VB learning for TVAR parameters and clean speech

The probabilistic formulations for noisy speech signal have finished. How to learn these distributions from noisy data is the key to speech modeling and enhancement. VB method is recently developed to approximate posterior density [9], [12]. In terms of graphical models in Fig. 1, denoting the clean speech by S={s1, s2,…, sT} corresponding to observation X={x1,x2,…,xT} and parameters Θ={W, F, R, β, γ, δ} and the approximate posterior pdf by Q(S, Θ), the marginal likelihood of a model p(X|H) can

Experiments

In this section, some experimental results with real-world speech signals will be presented to verify the performance of our proposed algorithm. To judge the performance of noise reduction objectively, the SNR was used [6]. To assess overall quality of enhanced signal, the subjective test based on the MUSHRA standard was carried out [16]. Fifteen listeners took part in the test. The centred position distance from the subject to the speaker was 1.5 m. The listeners were presented with all

Conclusion

In this paper, TVAR model with stochastically evolving parameters is applied in speech modeling and enhancement. VB learning is used to estimate clean signal and model parameters. The clean speech can be obtained by a VKS. The advantage of our proposed algorithm is that it fully exploits priors and statistical structures of the whole model. Moreover, it can automatically choose the model order and avoids overfitting in the estimation. Our algorithm has acquired better enhancement results

References (17)

  • R. Martin. Spectral subtraction based on minimum statistics, in: Proceedings of the EUSIPCO, Edinburgh, September 1994,...
  • R. Martin

    Noise power spectral density estimation based on optimal smoothing and minimum statistics

    IEEE Trans. Speech Audio Process.

    (2001)
  • H. Chipman et al.

    Adaptive Bayesian wavelet shrinkage

    J. Amer. Stat. Assoc.

    (1997)
  • Q. Huang et al.

    Bayesian adaptive shrinkage in independent component domain

    Electron. Papers

    (2006)
  • S.J. Godsill et al.

    Digital Audio Restoration—A Statistical Model-based Approach

    (1998)
  • S. Gannot et al.

    Iterative and sequential Kalman filter-based speech enhancement algorithms

    IEEE Trans. Speech Audio Process.

    (1998)
  • J. Vermaak, C. Andrieu, A. Doucet, S.J. Godsill, Non-stationary Bayesian modeling and enhancement of speech signals,...
  • M.J. Cassidy et al.

    Bayesian nonstationary autoregressive models for biomedical signal analysis

    IEEE Trans. Biomed. Eng.

    (2002)
There are more references available in the full text version of this article.

Cited by (3)

  • Variational Bayesian learning for removal of sparse impulsive noise from speech signals

    2018, Digital Signal Processing: A Review Journal
    Citation Excerpt :

    The corrupted signal is modeled as the sum of three terms, the AR model of the clean signal, a Gaussian noise term and a sparse noise term. The modeling of the corrupted signal is similar to that of robust principal component analysis [24–26], but is different from the traditional modeling methods of additive noise in AR models using Kalman filtering for speech enhancement [27–29]. The large values in the impulsive noise are modeled by the sparse noise term while the small values are modeled by the dense Gaussian distribution.

  • An investigation of speech enhancement using wavelet filtering method

    2010, International Journal of Speech Technology
View full text