Use of speech presence uncertainty with MMSE spectral energy estimation for robust automatic speech recognition
Graphical abstract
Research highlights
► The spectral energy estimator is investigated for the purpose robust automatic speech recognition. ► A speech presence uncertainty modification is proposed for the spectral energy estimator. ► When combined with speech presence uncertainty, the spectral energy estimator is shown to out-perform the log-spectral amplitude estimator for recognition robustness.
Introduction
The development of robust automatic speech recognition (ASR) is an important goal. While performance for state of the art ASR is impressive during ideal conditions, its recognition accuracy tends to degrade rapidly in the presence of additive background noise. Since it is often impossible to eliminate all noise from the operating environment, the problem of ASR robustness has been receiving considerable attention. Several approaches have been proposed in the literature, most of which fall under two categories: front-end speech/feature enhancement and back-end model adaptation. Back-end adaptation seeks to modify the acoustic models of the recognizer to better match the noisy operating environment. Font-end enhancement on the other hand, seeks to remove the effects of noise prior to recognition, either from the speech signal or from the parameterized features directly.
In this paper, we are interested in methods that perform enhancement on the speech signal. Several methods falling into this category have been reported in the literature (Loizou, 2007). This includes spectral subtraction (Berouti et al., 1979), minimum mean square error (MMSE) estimation (Ephraim and Malah, 1985), Wiener filtering (linear MMSE) (Wiener, 1949), Kalman filtering (Paliwal and Basu, 1987) and subspace (Ephraim and Trees, 1995) methods. These algorithms are specifically designed to improve the subjective quality of an acoustic signal for human listeners. For example, the MMSE log-spectral amplitude (LSA) estimator is often favored because of its psychoacoustic considerations. While many of the aforementioned algorithms have been used for robust ASR (Lathoud et al., 2005, Ephraim and Trees, 1991, Gemello et al., 2006, Hermus et al., 2007, Fujimoto and Ariki, 2000), there are clear differences between the objectives of robust ASR and speech enhancement.
For subjective human listening, it is often held that noise suppression is most effective when applied to the log-spectral domain. The LSA estimator was derived under such an assumption. However, the typical ASR system does not operate directly on the log-spectral domain. Instead, higher level features such as Mel-frequency cepstral coefficients (MFCCs) are used. As a result, the complicated suppression rule of the LSA estimator may not be fully justified for use in ASR-based speech enhancement.
In this paper, we examine a similar estimator for use in robust ASR; namely the spectral energy (SE) estimator. Specifically, we investigate its suitability for estimating clean speech MFCCs, from speech corrupted with additive noise. We show that the suppression rule of the SE estimator is closely related to the MMSE MFCC estimator. That is, an estimator that produces a cepstral estimate that minimizes the square error from the true, clean MFCC vector cx. Despite this, the SE estimator has several shortcomings that must be addressed before it can be used for robust ASR. Foremost among these problems is its tendency to under-suppress noise at low signal to noise ratios (SNRs). We identify two causes of this under-suppression: (1) an inherent positive bias when using the SE estimator to derive log-filterbank energies and (2) the tendency of the SE estimator to over-estimate the a priori SNR within a decision-directed framework (Ephraim and Malah, 1984). Later, we show that both of these issues may be corrected with the use of a heuristic based speech presence uncertainty (SPU). The proposed SE SPU estimator offers a number of advantages over the LSA estimator. First, its suppression rule is more efficiently implemented and second, it offers better recognition performance across a wide range of noise types and SNRs.
The rest of this paper is organized as follows. In Section 2, we cover the statistical framework used to derive the common short-time spectral amplitude estimators. In Section 3, we investigate the use of the SE estimator for deriving MFCC features. Firstly, we examine the optimality of the SE estimator in the context of MFCC estimation. Secondly, we highlight the considerations that must be taken into account for practical implementation of the SE estimator. In Section 4, we first describe the use of SPU within the spectral estimation framework. We then show how SPU may be used to overcome the limitations of the SE estimator. In Section 5, we present experimental ASR results for the RM (Price et al., 1988), OLLO2 (Wesker et al., 2005) and Aurora2 (Pearce and Hirsch, 2000) ASR tasks. Lastly in Section 6, we present concluding remarks.
Section snippets
Statistical framework for short-time spectral amplitude estimation
The discrete short-time Fourier transform (DSTFT) of corrupted speech signal y(n) is given bywhere k denotes the kth discrete frequency of K uniformly spaced frequencies, w(n) is an analysis window function, m is the short-time frame index and S is the analysis frame shift (in samples). In this paper, we consider an additive noise model. Here, the corrupted speech DSTFT may also be represented as
Use of the SE estimator for ASR
As stated earlier, our goal in this paper is to investigate the SE estimator for use in ASR-based speech enhancement. One immediate justification for this is the simple gain rule (10), which requires less computation than both the SA and LSA estimators. A second reason for investigating the spectral energy estimator, is that it is closely related to the log-filterbank energy estimator – the intermediate stage of the popular MFCC feature set (Huang et al., 2001).
Despite these reasons, use of the
Use of speech presence uncertainty to improve the spectral energy estimator
In the previous section, we have highlighted two causes of noise under-suppression in the SE estimator:
- 1.
The inherent positive bias of the SE estimator to derive log-filterbank energies.
- 2.
The tendency to over-estimate the a priori SNR ξ within the decision-directed framework.
Combined, these problems degrade ASR performance substantially in low SNR environments. To address both of these problems, we investigate the use of speech presence uncertainty (SPU) (McAulay and Malpass, 1980).
Enhancement system description
For our experiments, we decompose speech utterances into overlapping frames. Each analysis frame is 25 ms in length, and overlaps the previous analysis frame by 15 ms. Each analysis frame has a Hamming window applied before being enhanced with a given regime. Enhanced frames are then synthesized into coherent utterance with the overlap-add method (Crochiere, 1980). To derive the noise estimate λd(m, k), we use a simple voice activity detector (VAD). An initial noise estimate is generated from the
Conclusion
In this paper we have investigated the use of the spectral energy estimator for use in robust ASR. Traditionally, the spectral energy estimator has suffered from the problem of residual noise. In order to improve the SE estimator for use in robust ASR, we identified the causes of the residual noise. These problems were then addressed with a simple, heuristic based SPU. Experimental results show a significant improvement in robustness, over both the baseline results and the more common
References (23)
- Berouti, M., Schwartz, R., Makhoul, J., 1979. Enhancement of speech corrupted by acoustic noise. In: IEEE Internat....
Suppression of acoustic noise in speech using spectral subtraction
IEEE Trans. Acoust. Speech Signal Process.
(1979)Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor
IEEE Trans. Speech Audio Process.
(1994)A weighted overlap-add method of short-time Fourier analysis/synthesis
IEEE Trans. Acoust. Speech Signal Process.
(1980)- et al.
Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator
IEEE Trans. Acoust. Speech Signal Process.
(1984) - et al.
Speech enhancement using a minimum mean-square error log-spectral amplitude estimator
IEEE Trans. Acoust. Speech Signal Process.
(1985) - et al.
Constrained iterative speech enhancement with application to speech recognition
IEEE Trans. Signal Process.
(1991) - et al.
A signal subspace approach for speech enhancement
IEEE Trans. Speech Audio Process.
(1995) - Fujimoto, M., Ariki, Y., 2000. Noisy speech recognition using noise reduction method based on Kalman filter. In: IEEE...
- Gales, L.F.J., 1995. Model-Based Techniques For Robust Speech Recognition. Ph.D. Thesis, University of Cambridge,...
Automatic speech recognition with a modified Ephraim–Malah rule
IEEE Signal Process. Lett.
Cited by (13)
A novel speech enhancement method based on constrained low-rank and sparse matrix decomposition
2014, Speech CommunicationCitation Excerpt :Over the last fifty decades, many algorithms have been proposed about this field. The typical algorithms including spectral subtraction (Boll, 1979), minimum mean square error (MMSE) estimation (Ephraim and Malah, 1985; Ephraim and Malah, 1984; Stark and Paliwal, 2011), Wiener filtering (WF) (Soon and Koh, 2000; Wiener, 1949; Plapous et al., 2006; Scalart and Vieira-Filho, 1996), and subspace methods (Moor, 1993; Ephraim and Van Trees, 1995; Doclo and Moonen, 2002; Hu and Loizou, 2003; Hermus et al., 2007). Spectral subtraction and Wiener filtering were among the first introduced speech enhancement techniques.
Non-negative matrix factorization speech enhancement method based on constraints of temporal continuity
2019, Proceedings of 2019 IEEE 3rd Information Technology, Networking, Electronic and Automation Control Conference, ITNEC 2019Speech enhancement algorithms in vehicle environment
2019, International Journal of Performability EngineeringSpeech Denoising in White Noise Based on Signal Subspace Low-rank Plus Sparse Decomposition
2017, MATEC Web of Conferences