Elsevier

Speech Communication

Volume 48, Issue 11, November 2006, Pages 1486-1501
Speech Communication

Binary and ratio time-frequency masks for robust speech recognition

https://doi.org/10.1016/j.specom.2006.09.003Get rights and content

Abstract

A time-varying Wiener filter specifies the ratio of a target signal and a noisy mixture in a local time-frequency unit. We estimate this ratio using a binaural processor and derive a ratio time-frequency mask. This mask is used to extract the speech signal, which is then fed to a conventional speech recognizer operating in the cepstral domain. We compare the performance of this system with a missing-data recognizer that operates in the spectral domain using the time-frequency units that are dominated by speech. To apply the missing-data recognizer, the same binaural processor is used to estimate an ideal binary time-frequency mask, which selects a local time-frequency unit if the speech signal within the unit is stronger than the interference. We find that the performance of the missing data recognizer is better on a small vocabulary recognition task but the performance of the conventional recognizer is substantially better when the vocabulary size is increased.

Introduction

The performance of automatic speech recognizers (ASRs) degrades rapidly in the presence of noise, microphone variations and room reverberation (Gong, 1995, Lippmann, 1997). Speech recognizers are typically trained on clean speech and face a problem of mismatch when used in conditions where speech occurs simultaneously with other sound sources. To mitigate the effect of this mismatch on recognition, noisy speech is typically preprocessed by speech enhancement algorithms, such as microphone arrays (Bradstein and Ward, 2001, Cardoso, 1998, Ehlers and Schuster, 1997, Hughes et al., 1999), computational auditory scene analysis (CASA) systems (Brown and Wang, 2005, Rosenthal and Okuno, 1998) or spectral subtraction techniques (Boll, 1979, Droppo et al., 2002). Microphone arrays require the number of sensors to increase as the number of interfering sources increases. Monaural CASA systems employ harmonicity as the primary cue for grouping acoustic components corresponding to speech. These systems, however, do not perform in time-frequency (T-F) regions that are dominated by unvoiced speech. Spectral subtraction systems typically assume stationary noise. Hence, in the presence of non-stationary noise sources, their performance is not adequate for recognition (Cooke et al., 2001). If samples of the corrupting noise source are available a priori, a model for the noise source can additionally be trained and noisy speech may be jointly decoded using the trained models of speech and noise (Gales and Young, 1996, Varga and Moore, 1990) or enhanced using linear filtering methods (Ephraim, 1992). However, in many realistic applications, adequate amounts of noise samples are not available a priori and hence training of a noise model is not feasible.

Recently, a missing-data approach to speech recognition in noisy environments has been proposed by Cooke et al. (2001). This method is based on distinguishing between reliable and unreliable data. When speech is contaminated by additive noise, some time-frequency units contain predominantly speech energy (reliable) and the rest are dominated by noise energy. The missing-data method treats the latter T-F units as missing or unreliable during recognition (see Section 4.2). Missing T-F units are identified by thresholding the T-F units based on local SNR. Spectral subtraction is typically used to estimate the local SNR. The performance of the missing-data recognizer is significantly better than the performance of a system using spectral subtraction for speech enhancement followed by recognition of enhanced speech (Cooke et al., 2001).

A potential disadvantage of the missing-data recognizer is that recognition is performed in the spectral or T-F domain. It is well known that recognition using cepstral coefficients yields a superior performance compared to recognition using spectral coefficients under clean speech conditions (Davis and Mermelstein, 1980). The superiority of the cepstral features stems from the ability of the cepstral transformation to separate vocal-tract filtering from excitation source in speech production (Rabiner and Juang, 1993). Additionally, the cepstral transform approximately orthogonalizes the spectral features (Shire, 2000). Since the missing-data recognition is based on marginalizing the unreliable T-F features during recognition, it is coupled with a spectral or T-F representation. Any global transformation of the spectral features (e.g. cepstral transformation) smears the information from the noisy T-F units across all the global features, preventing its effective marginalization. Attempts to adapt the missing-data method to the cepstral domain have centered around reconstruction or imputation of the missing values in the spectral domain followed by transformation to the cepstral domain (Cooke et al., 2001, Raj et al., 2004). Alternatively, van Hamme (2003) performs imputation directly in the cepstral domain. These reconstructions are typically based either on the speech recognizer itself or on other trained models of speech. The success of these model-based imputation techniques depend on the adequacy of reliable data for identification of the correct speech model for imputation. In addition, errors in imputation procedures affect the performance of the system even when the model is correctly identified.

Another potential drawback of the missing-data recognizer, which has not been well studied, is the problem of data paucity. The amount of “reliable” data available to the recognizer is a function of both SNR and frequency characteristics of the noise source. A decrease in SNR, as well as an increase in the bandwidth of the noise source causes an increase in the amount of missing data. This leads to a deterioration in performance for a small vocabulary task (Cooke et al., 2001). The reduction in reliable data may pose an additional problem for recognition with larger vocabulary sizes. Paucity of reliable data constrains the missing-data recognizer to use only a small portion of the total T-F acoustic model space. This reduced space may be insufficient to differentiate between a large number of competing hypotheses during decoding. In this paper, we study this issue by comparing the performance of the missing-data recognizer on two tasks with different vocabulary sizes.

Binaural CASA systems that compute binary masks have been used successfully as front-ends for the missing-data recognizer on small vocabulary tasks (Palomaki et al., 2004, Roman et al., 2003). Such systems compare the acoustic signals at the two ears in order to extract the binaural cues of interaural time differences (ITD) and interaural intensity differences (IID). These binaural cues are correlated with the location of a sound source and hence provide powerful mechanisms for segregating sound sources from different locations. Moreover, binaural processing is independent of the signal content and hence can be used to segregate both voiced and unvoiced speech components from a noisy mixture. The computational goal of the binaural CASA systems is an ideal binary mask. A T-F unit in the ideal binary mask is labeled 1 or reliable if the corresponding T-F unit of the noisy speech contains more speech energy than interference energy; it is labeled 0 or unreliable otherwise.1 We employ a recent binaural speech segregation system (Roman et al., 2003) to estimate an ideal binary T-F mask. This mask is fed to the missing-data recognizer and recognition is performed in the spectral domain.

The minimum mean-square error (MMSE) based short-time spectral amplitude estimator, which utilizes a priori SNR in a local T-F unit, has been used previously to effectively enhance noisy speech (Ephraim and Malah, 1984). A priori SNR can be obtained if premixing speech and noise signals are available. Roman et al. (2003) have shown that in a narrow frequency band, there exists a systematic relationship between a priori SNR and values of ITD and IID. Motivated by this observation, we estimate an ideal ratio T-F mask using statistics collected for ITD and IID at each individual frequency bin. A unit in the ratio mask is a measure of the speech energy to total energy (speech and noise) in the corresponding T-F unit of the noisy signal. The ratio mask is then used to enhance the speech, enabling recognition using Mel-frequency cepstral coefficients (MFCCs). We use “conventional recognizer” to refer to a continuous density hidden Markov model (HMM) based ASR using MFCCs as features.

We compare the performance of the conventional recognizer to that of the missing-data recognizer on a robust speech recognition task. In particular, we examine the effect of vocabulary size on the performance of the two recognizers. We find that on a small vocabulary task, the missing-data recognizer outperforms the conventional ASR. Our finding is consistent with a previous comparison using a binaural front-end made on a small vocabulary “cocktail-party” recognition task (Glotin et al., 1999, Tessier et al., 1999). The accuracy of results obtained using the missing-data method in the spectral domain was reported to be better than those obtained using the conventional ASR in the cepstral domain. With an increase in the vocabulary size, however, the conventional ASR performs substantially better. Results using the missing value imputation methods have been reported on a larger vocabulary previously (Raj et al., 2004). Their method uses a binary mask and therefore is subject to the same limitations stated previously.

The rest of the paper is organized as follows. Section 2 provides an overview of the proposed systems. We then describe the binaural front-end for both the conventional and missing-data recognizers in Section 3. The section additionally provides the estimation details of ideal binary and ratio T-F masks. The conventional and missing-data recognition methods are reviewed in Section 4. The recognizers are tested on two different task domains with different vocabulary sizes. Section 5 discusses the two tasks and presents the evaluation results of the recognizers along with a comparison of their relative performance. Finally, conclusion and future work are given in Section 6.

Section snippets

System overview

In this study, we analyze two strategies for robust speech recognition: (1) missing-data recognition and (2) a system that combines speech enhancement with a conventional ASR. The performance is examined at various SNR conditions and for two vocabulary sizes. Fig. 1 shows the architecture of the two different processing strategies.

The input to both systems is a binaural mixture of speech and interference presented at different, but fixed, locations. The measurements of head-related transfer

A localization based front-end for ASR

When speech and additive noise are orthogonal, the linear MMSE filter is the Wiener filter (van Trees, 1968). With a frame-based processing, the MMSE filter corresponds to the ratio of speech eigenvalues to the sum of eigenvalues of speech and noise (van Trees, 1968). The eigenvalues can be computed from the auto-covariance functions prior to mixing by considering speech and noise to be two distinct random processes. Under asymptotic conditions, the MMSE filter corresponds to the frame-based

Recognition strategies

We evaluate the binaural segregation system described in Section 3 as the front-end for robust ASR using two different recognizers. Conventional ASR uses MFCCs as the parameterization of observed speech. MFCCs are computed from the segregated speech obtained after applying the ratio mask to the noisy input signal (see Eq. (5)). The missing-data recognizer uses log-spectral energy as feature vectors in conjunction with the binary mask, generated by the binaural system. An HMM toolkit, HTK (Young

Evaluation results

To compare the effect of vocabulary size on the two recognition approaches outlined above, we choose two task domains. The first task is speaker-independent recognition of connected digits. The grammar for this task allows for the repetition of one or more digits. This is the same task used in the original study of Cooke et al. (2001). Thirteen (1–9, a silence, very short pause between words, zero and oh) word-level models are trained for both recognizers. All except the short pause model have

Discussion

The advantage of the missing-data recognizer is that it imposes a lesser demand on the speech enhancement front-end than the conventional ASR. Only knowledge of reliable T-F units of noisy speech, or an ideal binary mask, is required from the front-end. Moreover, Roman et al. (2003) have shown that the performance of the missing-data recognizer degrades gradually with increasing deviation from the ideal binary mask. The binaural system employed here is able to estimate this mask accurately.

Acknowledgments

This research was supported in part by an AFOSR Grant (FA9550-04-1-0117) and an NSF Grant (IIS-0081058). We thank M. Cooke for discussion and assistance in implementing the missing-data recognizer, and D. Pearce for helping us obtain the ETSI advanced front-end algorithm. We also thank the three anonymous reviewers for their suggestions/criticisms. A preliminary version of this work was presented in 2004 ICSLP.

References (43)

  • J.F. Cardoso

    Blind signal separation: statistical principles

    Proc. IEEE

    (1998)
  • Chen, C.-P., Bilmes, J., Ellis, D.P.W., 2005. Speech feature smoothing for robust ASR. In: Proc. IEEE International...
  • Cole, R., Noel, M., Lander, T., Durham, T., 1995. New telephone speech corpora at CSLU. In: Proc. European Conference...
  • Cunningham, S., Cooke, M., 1999. The role of evidence and counter-evidence in speech perception. In: Proc....
  • S.B. Davis et al.

    Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences

    IEEE Trans. Acoust. Speech Signal Processing

    (1980)
  • de Veth, J., de Wet, F., Cranen, B., Boves, L., 1999. Missing feature theory in ASR: make sure you miss the right type...
  • Droppo, J., Acero, A., Deng, L., 2002. A nonlinear observation model for removing noise from corrupted speech log...
  • F. Ehlers et al.

    Blind separation of convolutive mixtures and an application in automatic speech recognition in a noisy environment

    IEEE Trans. Signal Processing

    (1997)
  • Y. Ephraim

    A Bayesian estimation approach for speech enhancement using hidden Markov models

    IEEE Trans. Signal Processing

    (1992)
  • Y. Ephraim et al.

    Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator

    IEEE Trans. Acoust. Speech Signal Processing

    (1984)
  • M.J.F. Gales et al.

    Robust continuous speech recognition using parallel model combination

    IEEE Trans. Speech Audio Processing

    (1996)
  • Cited by (216)

    • Dual-Branch Modeling Based on State-Space Model for Speech Enhancement

      2024, IEEE/ACM Transactions on Audio Speech and Language Processing
    View all citing articles on Scopus
    View full text