Use of speech presence uncertainty with MMSE spectral energy estimation for robust automatic speech recognition

doi:10.1016/j.specom.2010.08.001

Speech Communication

Volume 53, Issue 1, January 2011, Pages 51-61

https://doi.org/10.1016/j.specom.2010.08.001 Get rights and content

Abstract

In this paper, we investigate the use of the minimum mean square error (MMSE) spectral energy estimator for use in environment-robust automatic speech recognition (ASR). In the past, it has been common to use the MMSE log-spectral amplitude estimator for this task. However, this estimator was originally derived under subjective human listening criteria. Therefore its complex suppression rule may not be optimal for use in ASR. On the other hand, it can be shown that the MMSE spectral energy estimator is closely related to the MMSE Mel-frequency cepstral coefficient (MFCC) estimator. Despite this, the spectral energy estimator has tended to suffer from the problem of excessive residual noise. We examine the cause of this residual noise and show that the introduction of a heuristic based speech presence uncertainty (SPU) can significantly improve its performance as a front-end ASR enhancement regime. The proposed spectral energy SPU estimator is evaluated on the Aurora2, RM and OLLO2 speech recognition tasks and can be shown to significantly improve additive noise robustness over the more common spectral amplitude and log-spectral amplitude estimators.

Graphical abstract

Research highlights

► The spectral energy estimator is investigated for the purpose robust automatic speech recognition. ► A speech presence uncertainty modification is proposed for the spectral energy estimator. ► When combined with speech presence uncertainty, the spectral energy estimator is shown to out-perform the log-spectral amplitude estimator for recognition robustness.

Introduction

The development of robust automatic speech recognition (ASR) is an important goal. While performance for state of the art ASR is impressive during ideal conditions, its recognition accuracy tends to degrade rapidly in the presence of additive background noise. Since it is often impossible to eliminate all noise from the operating environment, the problem of ASR robustness has been receiving considerable attention. Several approaches have been proposed in the literature, most of which fall under two categories: front-end speech/feature enhancement and back-end model adaptation. Back-end adaptation seeks to modify the acoustic models of the recognizer to better match the noisy operating environment. Font-end enhancement on the other hand, seeks to remove the effects of noise prior to recognition, either from the speech signal or from the parameterized features directly.

In this paper, we are interested in methods that perform enhancement on the speech signal. Several methods falling into this category have been reported in the literature (Loizou, 2007). This includes spectral subtraction (Berouti et al., 1979), minimum mean square error (MMSE) estimation (Ephraim and Malah, 1985), Wiener filtering (linear MMSE) (Wiener, 1949), Kalman filtering (Paliwal and Basu, 1987) and subspace (Ephraim and Trees, 1995) methods. These algorithms are specifically designed to improve the subjective quality of an acoustic signal for human listeners. For example, the MMSE log-spectral amplitude (LSA) estimator is often favored because of its psychoacoustic considerations. While many of the aforementioned algorithms have been used for robust ASR (Lathoud et al., 2005, Ephraim and Trees, 1991, Gemello et al., 2006, Hermus et al., 2007, Fujimoto and Ariki, 2000), there are clear differences between the objectives of robust ASR and speech enhancement.

For subjective human listening, it is often held that noise suppression is most effective when applied to the log-spectral domain. The LSA estimator was derived under such an assumption. However, the typical ASR system does not operate directly on the log-spectral domain. Instead, higher level features such as Mel-frequency cepstral coefficients (MFCCs) are used. As a result, the complicated suppression rule of the LSA estimator may not be fully justified for use in ASR-based speech enhancement.

In this paper, we examine a similar estimator for use in robust ASR; namely the spectral energy (SE) estimator. Specifically, we investigate its suitability for estimating clean speech MFCCs, from speech corrupted with additive noise. We show that the suppression rule of the SE estimator is closely related to the MMSE MFCC estimator. That is, an estimator that produces a cepstral estimate ${\hat{c}}_{x}$ that minimizes the square error from the true, clean MFCC vector c_x. Despite this, the SE estimator has several shortcomings that must be addressed before it can be used for robust ASR. Foremost among these problems is its tendency to under-suppress noise at low signal to noise ratios (SNRs). We identify two causes of this under-suppression: (1) an inherent positive bias when using the SE estimator to derive log-filterbank energies and (2) the tendency of the SE estimator to over-estimate the a priori SNR within a decision-directed framework (Ephraim and Malah, 1984). Later, we show that both of these issues may be corrected with the use of a heuristic based speech presence uncertainty (SPU). The proposed SE SPU estimator offers a number of advantages over the LSA estimator. First, its suppression rule is more efficiently implemented and second, it offers better recognition performance across a wide range of noise types and SNRs.

The rest of this paper is organized as follows. In Section 2, we cover the statistical framework used to derive the common short-time spectral amplitude estimators. In Section 3, we investigate the use of the SE estimator for deriving MFCC features. Firstly, we examine the optimality of the SE estimator in the context of MFCC estimation. Secondly, we highlight the considerations that must be taken into account for practical implementation of the SE estimator. In Section 4, we first describe the use of SPU within the spectral estimation framework. We then show how SPU may be used to overcome the limitations of the SE estimator. In Section 5, we present experimental ASR results for the RM (Price et al., 1988), OLLO2 (Wesker et al., 2005) and Aurora2 (Pearce and Hirsch, 2000) ASR tasks. Lastly in Section 6, we present concluding remarks.

Section snippets

Statistical framework for short-time spectral amplitude estimation

The discrete short-time Fourier transform (DSTFT) of corrupted speech signal y(n) is given by $Y (m, k) = \sum_{n = - \infty}^{\infty} y (n) w (mS - n) \exp (- j 2 π kn / K),$ where k denotes the kth discrete frequency of K uniformly spaced frequencies, w(n) is an analysis window function, m is the short-time frame index and S is the analysis frame shift (in samples). In this paper, we consider an additive noise model. Here, the corrupted speech DSTFT may also be represented as

Use of the SE estimator for ASR

As stated earlier, our goal in this paper is to investigate the SE estimator for use in ASR-based speech enhancement. One immediate justification for this is the simple gain rule (10), which requires less computation than both the SA and LSA estimators. A second reason for investigating the spectral energy estimator, is that it is closely related to the log-filterbank energy estimator – the intermediate stage of the popular MFCC feature set (Huang et al., 2001).

Despite these reasons, use of the

Use of speech presence uncertainty to improve the spectral energy estimator

In the previous section, we have highlighted two causes of noise under-suppression in the SE estimator:

1.
The inherent positive bias of the SE estimator to derive log-filterbank energies.
2.
The tendency to over-estimate the a priori SNR ξ within the decision-directed framework.

Combined, these problems degrade ASR performance substantially in low SNR environments. To address both of these problems, we investigate the use of speech presence uncertainty (SPU) (McAulay and Malpass, 1980).

Enhancement system description

For our experiments, we decompose speech utterances into overlapping frames. Each analysis frame is 25 ms in length, and overlaps the previous analysis frame by 15 ms. Each analysis frame has a Hamming window applied before being enhanced with a given regime. Enhanced frames are then synthesized into coherent utterance with the overlap-add method (Crochiere, 1980). To derive the noise estimate λ_d(m, k), we use a simple voice activity detector (VAD). An initial noise estimate is generated from the

Conclusion

In this paper we have investigated the use of the spectral energy estimator for use in robust ASR. Traditionally, the spectral energy estimator has suffered from the problem of residual noise. In order to improve the SE estimator for use in robust ASR, we identified the causes of the residual noise. These problems were then addressed with a simple, heuristic based SPU. Experimental results show a significant improvement in robustness, over both the baseline results and the more common

References (23)

Berouti, M., Schwartz, R., Makhoul, J., 1979. Enhancement of speech corrupted by acoustic noise. In: IEEE Internat....
S. Boll
Suppression of acoustic noise in speech using spectral subtraction
IEEE Trans. Acoust. Speech Signal Process.
(1979)
O. Cappe
Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor
IEEE Trans. Speech Audio Process.
(1994)
R. Crochiere
A weighted overlap-add method of short-time Fourier analysis/synthesis
IEEE Trans. Acoust. Speech Signal Process.
(1980)
Y. Ephraim et al.
Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator
IEEE Trans. Acoust. Speech Signal Process.
(1984)
Y. Ephraim et al.
Speech enhancement using a minimum mean-square error log-spectral amplitude estimator
IEEE Trans. Acoust. Speech Signal Process.
(1985)
Y. Ephraim et al.
Constrained iterative speech enhancement with application to speech recognition
IEEE Trans. Signal Process.
(1991)
Y. Ephraim et al.
A signal subspace approach for speech enhancement
IEEE Trans. Speech Audio Process.
(1995)
Fujimoto, M., Ariki, Y., 2000. Noisy speech recognition using noise reduction method based on Kalman filter. In: IEEE...
Gales, L.F.J., 1995. Model-Based Techniques For Robust Speech Recognition. Ph.D. Thesis, University of Cambridge,...

R. Gemello et al.

Automatic speech recognition with a modified Ephraim–Malah rule

IEEE Signal Process. Lett.

(2006)

Cited by (13)

A novel speech enhancement method based on constrained low-rank and sparse matrix decomposition
2014, Speech Communication
Citation Excerpt :
Over the last fifty decades, many algorithms have been proposed about this field. The typical algorithms including spectral subtraction (Boll, 1979), minimum mean square error (MMSE) estimation (Ephraim and Malah, 1985; Ephraim and Malah, 1984; Stark and Paliwal, 2011), Wiener filtering (WF) (Soon and Koh, 2000; Wiener, 1949; Plapous et al., 2006; Scalart and Vieira-Filho, 1996), and subspace methods (Moor, 1993; Ephraim and Van Trees, 1995; Doclo and Moonen, 2002; Hu and Loizou, 2003; Hermus et al., 2007). Spectral subtraction and Wiener filtering were among the first introduced speech enhancement techniques.
In this paper, we present a novel speech enhancement method based on the principle of constrained low-rank and sparse matrix decomposition (CLSMD). According to the proposed method, noise signal can be assumed as a low-rank component because noise spectra within different time frames are usually highly correlated with each other; while the speech signal is regarded as a sparse component since it is relatively sparse in time–frequency domain. Based on these assumptions, we develop an alternative projection algorithm to separate the speech and noise magnitude spectra by imposing rank and sparsity constraints, with which the enhanced time-domain speech can be constructed from sparse matrix by inverse discrete Fourier transform and overlap-add-synthesis. The proposed method is significantly different from existing speech enhancement methods. It can estimate enhanced speech in a straightforward manner, and does not need a voice activity detector to find noise-only excerpts for noise estimation. Moreover, it can obtain better performance in low SNR conditions, and does not need to know the exact distribution of noise signal. Experimental results show the new method can perform better than conventional methods in many types of strong noise conditions, in terms of yielding less residual noise and lower speech distortion.
An educational platform to demonstrate speech processing techniques on Android based smart phones and tablets
2014, Speech Communication
This work highlights the need to adapt teaching methods in digital signal processing (DSP) on speech to suit shifts in generational learning behavior, furthermore it suggests the use of integrating theory into a practical smart phone or tablet application as a means to bridge the gap between traditional teaching styles and current learning styles. The application presented here is called “Speech Enhancement for Android (SEA)” and aims at assisting in the development of an intuitive understanding of course content by allowing students to interact with theoretical concepts through their personal device. SEA not only allows the student to interact with speech processing methods, but also enables the student to interact with their surrounding environment by recording and processing their own voice. A case study on students studying DSP for speech processing found that by using SEA as an additional learning tool enhanced their understanding and helped to motivate students to engage in course work by way of having ready access to interactive content on a hand held device. This paper describes the platform in detail acting as a road-map for education institutions, and how it can be integrated into a DSP based speech processing education framework.
Non-negative matrix factorization speech enhancement method based on constraints of temporal continuity
2019, Proceedings of 2019 IEEE 3rd Information Technology, Networking, Electronic and Automation Control Conference, ITNEC 2019
Speech Enhancement Based on Constrained Low-rank Sparse Matrix Decomposition Integrated with Temporal Continuity Regularisation
2019, Archives of Acoustics
Speech enhancement algorithms in vehicle environment
2019, International Journal of Performability Engineering
Speech Denoising in White Noise Based on Signal Subspace Low-rank Plus Sparse Decomposition
2017, MATEC Web of Conferences

View all citing articles on Scopus

View full text

Use of speech presence uncertainty with MMSE spectral energy estimation for robust automatic speech recognition

Abstract

Graphical abstract

Research highlights

Introduction

Section snippets

Statistical framework for short-time spectral amplitude estimation

Use of the SE estimator for ASR

Use of speech presence uncertainty to improve the spectral energy estimator

Enhancement system description

Conclusion

Suppression of acoustic noise in speech using spectral subtraction

IEEE Trans. Acoust. Speech Signal Process.

Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor

IEEE Trans. Speech Audio Process.

A weighted overlap-add method of short-time Fourier analysis/synthesis

IEEE Trans. Acoust. Speech Signal Process.

Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator

IEEE Trans. Acoust. Speech Signal Process.

Speech enhancement using a minimum mean-square error log-spectral amplitude estimator

IEEE Trans. Acoust. Speech Signal Process.

Constrained iterative speech enhancement with application to speech recognition

IEEE Trans. Signal Process.

A signal subspace approach for speech enhancement

IEEE Trans. Speech Audio Process.

Automatic speech recognition with a modified Ephraim–Malah rule

IEEE Signal Process. Lett.