Techniques for handling convolutional distortion with `missing data' automatic speech recognition

https://doi.org/10.1016/j.specom.2004.02.005Get rights and content

Abstract

In this study we describe two techniques for handling convolutional distortion with `missing data' speech recognition using spectral features. The missing data approach to automatic speech recognition (ASR) is motivated by a model of human speech perception, and involves the modification of a hidden Markov model (HMM) classifier to deal with missing or unreliable features. Although the missing data paradigm was proposed as a means of handling additive noise in ASR, we demonstrate that it can also be effective in dealing with convolutional distortion. Firstly, we propose a normalisation technique for handling spectral distortions and changes of input level (possibly in the presence of additive noise). The technique computes a normalising factor only from the most intense regions of the speech spectrum, which are likely to remain intact across various noise conditions. We show that the proposed normalisation method improves performance compared to a conventional missing data approach with spectrally distorted and noise contaminated speech, and in conditions where the gain of the input signal varies. Secondly, we propose a method for handling reverberated speech which attempts to identify time-frequency regions that are not badly contaminated by reverberation and have strong speech energy. This is achieved by using modulation filtering to identify `reliable' regions of the speech spectrum. We demonstrate that our approach improves recognition performance in cases where the reverberation time T60 exceeds 0.7 s, compared to a baseline system which uses acoustic features derived from perceptual linear prediction and the modulation-filtered spectrogram.

Introduction

Although much research effort has been expended on the development of automatic speech recognition (ASR) systems, their performance still remains far from that of human listeners. In particular, human speech perception is robust when speech is corrupted by noise or by other environmental interference, such as reverberation or a poor transmission line (for example, see Assmann and Summerfield, 2003; Nabelek and Robinson, 1982). In contrast, ASR performance falls dramatically in such conditions (for a comparative review of human and automatic speech recognition performance in noise see Lippmann, 1997). As several researchers have observed (e.g., Cooke et al., 2001; Hermansky, 1998; Lippmann, 1997), the current limitations of ASR systems might reflect our limited understanding of human speech perception, and especially our inadequate technological replication of the underlying processes.

The robustness of human speech perception can be attributed to two main factors. First, listeners are able to segregate complex acoustic mixtures in order to extract a description of a target sound source (such as the voice of a speaker). Bregman (1990) describes this process as `auditory scene analysis'. Secondly, human speech perception is robust even when speech is partly masked by noise, or when parts of the acoustic spectrum are removed altogether (for example, by a bandlimited communications channel). Cooke et al. (2001) have interpreted this ability in terms of a `missing data' model of speech recognition, and have adapted a hidden Markov model (HMM) classifier to deal with missing or unreliable features. In their system, a time-frequency `mask' is employed to indicate whether acoustic features are reliable or corrupted; according to this division the features are treated differently by the recogniser. Typically, the missing data mask is derived from auditory-motivated processing, such as pitch analysis (Barker et al., 2001a; Brown et al., 2001) or binaural spatial processing (Palomäki et al., 2001, Palomäki et al., in press). Alternatively, the mask can be set according to local estimates of the signal-to-noise ratio (SNR) (Cooke et al., 2001).

The missing data paradigm was conceived by Cooke et al. as a means of dealing with additive noise in ASR. As a result, little consideration has been given to the ability of missing data ASR systems to handle interference caused by the interaction of a target sound with its environment (such as a transmission line, audio equipment or reverberant space). In terms of signal theory this is regarded as convolutional interference. In this paper, we propose a number of modifications to a missing data ASR system which allow it to perform robustly in the presence of convolutional noise.

A convolutional interference can be characterised by the impulse response of the corresponding system. If the length of the impulse response is short compared to the analysis window, then the interference mainly causes spectral alteration (see Avendano, 1997, Chapter 5). This follows because convolution in the time domain is equivalent to multiplication in the frequency domain (see Oppenheim and Schafer (1989) for a description of the convolution theorem of the Fourier transform). The analysis window used in speech processing is usually longer than 10 ms, which roughly corresponds to the pitch period of an average adult male voice. Examples of practical systems having short impulse responses are transmission lines, microphones and loudspeakers.

In the case of room reverberation the interaction is of a different nature, because the impulse response of a room is relatively long (from approximately 0.2 up to 5 s) compared to the window used for speech analysis. A typical room impulse response consists of sparse early reflections followed by dense late reverberation (higher-order reflections), which forms the exponentially decaying tail of the response. The sparse early reflections are highly correlated with the speech signal and often contribute usefully to speech intelligibility by increasing the loudness of the speech. However, early reflections can also cause some spectral deviation due to comb filtering caused by successive reflections and the varying frequency characteristics of surface absorption. In contrast, the dense late reverberation is poorly correlated with the original speech signal and therefore behaves more like additive noise. Indeed, early versus late reverberation has successfully been used as a predictor of speech intelligibility in rooms (Bradley, 1986). It is common to define a critical delay time for early and late reverberation, such that reflections arriving before the delay are beneficial to auditory perception whereas reflections arriving after it will have a detrimental effect. The European norm ISO 3382 (1997) suggests critical delays of 50 ms for speech and 80 ms for music perception. Gölzer and Kleinschmidt (2003) investigated the role of early and late reflections in conjunction with ASR. They suggested that reflections have a conducive effect on speech recognition accuracy up to a critical delay of 25–50 ms, assuming that late reverberation is strongly present in the room impulse response. Further details of the effect of room acoustics on speech intelligibility can be found in (Bradley, 1986; Houtgast and Steeneken, 1985).

The conventional way of tackling convolutional interference in ASR has been to use cepstral encoding, and to employ cepstral mean subtraction to remove the spectral distortion. Two common examples of cepstral encoding are mel-frequency cepstral coefficients (MFCC) (Davis and Mermelstein, 1980) and cepstral features obtained by perceptual linear prediction (PLP) (Hermansky, 1990). Interestingly, both of these approaches are loosely based on known mechanisms of auditory frequency encoding. However, they have been found to perform inadequately with reverberated speech (Kingsbury, 1998; Kingsbury et al., 1998). Reverberation can also be handled via blind source separation (BSS) using a microphone array, or via blind deconvolution or dereverberation (for an overview see Omologo et al., 1998). In such approaches, the aim is to enhance subjective speech quality rather than to find a robust acoustic encoding. BSS gives good dereverberation performance, but at least two microphone signals are needed to process a single speech source (for an overview of BSS and independent component analysis see Hyvärinen et al., 2001).

Kingsbury and his colleagues (Kingsbury, 1998; Kingsbury et al., 1998) have reported that a modulation-filtered spectral representation, the modulation spectrogram (MSG), can improve ASR performance with reverberated speech. Spectral bands are processed by a modulation filter, which emphasizes the strongest speech modulations and effectively removes reverberant or noisy regions that are not modulated in the same way as speech signals. This approach is consistent with studies that demonstrate the importance of low frequency modulations in human speech recognition (Houtgast and Steeneken, 1985; Drullman et al., 1994) and in ASR (Kandera et al., 1999).

In this study we address the problem of handling convolutional distortion in a missing data ASR system which uses spectral speech features. Two conditions are considered; one in which speech is subject to spectral distortion and additive noise, and another in which speech is reverberated. In the first case, we derive a missing data mask from estimates of the SNR in local time-frequency regions, and employ spectral subtraction to remove the noise background. Furthermore, we introduce a new method for normalising spectral features that is compatible with the missing data ASR framework. In reverberant conditions, a modulation filtering scheme is used to generate the missing data mask. This approach exploits temporal modulations of speech in order to find spectro-temporal regions which are not severely contaminated by reverberation.

The current study extends our previous work in several important respects. A related scheme for spectral normalisation was presented in (Palomäki et al., in press), but it was applied only to a very specific purpose (speech recognition using a binaural hearing model). Here, we develop and evaluate the normalisation scheme more thoroughly, and evaluate it on a more general speech recognition task with different types of spectral distortion. Our early work on modulation mask estimation (Palomäki et al., 2002) suffered from the drawback that the algorithm needed to be hand-tuned to each different reverberation condition. This problem has now been addressed by an adaptive scheme, in which the parameters of the algorithm are set according to an estimate of the degree of reverberation present in the signal. This allows the same system to be used in a wide range of reverberation conditions without the need for hand-tuning. Finally, in (Palomäki et al., 2002) the system was evaluated on a limited number of simulated room impulse responses (RIRs), whereas here we use real RIRs which vary in their T60 reverberation time between 0.7 and 1.5 s. The results obtained with our new method are also compared against Kingsbury (1998) recogniser for reverberated speech, which uses MSG and PLP features.

Section 2 of the paper describes the overall architecture of the missing data ASR system and the acoustic features used. In Section 3, we present a processing pathway that is optimised for conditions in which speech is subject to spectral distortion and additive noise. A processing pathway for reverberant conditions is described in Section 4. The system is evaluated under a number of noise conditions in Section 5, and compared against a baseline approach. We conclude with a discussion in Section 6.

Section snippets

Speech recogniser

The missing data speech recognition system is shown schematically in Fig. 1. In this section we describe the front-end processing, which extracts spectral features using an auditory model, and explain the missing data ASR approach.

Processing for spectral distortion and additive noise

In this section we describe a processing pathway that compensates for spectral distortion and additive noise. Our approach is based on the combination of three techniques; estimation of a missing data mask on the basis of SNR in local time-frequency regions (Section 3.1), spectral subtraction (Section 3.2) and an approach to spectral feature normalisation which is suitable for missing data ASR in the presence of additive noise (Section 3.3).

Processing for reverberation

This section describes a processing pathway for missing data ASR in reverberant conditions (see Fig. 1). In the first stage, modulation filtering is used to derive a mask that identifies the speech features that are least contaminated by reverberation. Following this, spectral features are normalised using a modification of the technique described in Section 3.3.

Corpus and recogniser configuration

The missing data ASR system was evaluated using a subset of the Aurora 2.0 connected English language digits recognition task (Pearce and Hirsch, 2000). The sampling rate of all speech data was 8 kHz. Auditory rate maps were obtained for the clean training section of the Aurora corpus, and were used to train 12 word-level HMMs (see Section 2.2). The training section contained 8440 clean (noiseless) utterances. For the evaluation of the missing data system we trained two kinds of recognisers,

Discussion

In this paper we have described techniques for handling convolutional distortion in `missing data' speech recognition, an issue which has been largely unaddressed to date. As the convolutional interference can be quite different in nature depending upon the length of the impulse response concerned, we propose two approaches; one to handle spectral distortion due to a transmission line or audio equipment, and another to handle room reverberation interference. In summary, the results show

Acknowledgements

We thank two anonymous reviewers for their incisive comments. KJP was funded by the EC TMR SPHEAR project, the Academy of Finland (project number 1277811) and was partially supported by a Finnish Nokia säätiö grant. GJB was funded by EPSRC grant GR/R47400/01. The authors owe many thanks to Dan Ellis, Brian Kingsbury and Heidi Christensen for their kind help with implementing the MSG + PLP baseline system. Dan Ellis and Brian Kingsbury also made some of the real room impulse responses available to

References (59)

  • B.S Atal

    Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification

    J. Acoust. Soc. Am.

    (1974)
  • J Barker et al.

    Decoding speech in the presence of other sound sources

    Proc. ICSLP-2000

    (2000)
  • J Barker et al.

    Soft decisions in missing data techniques for robust automatic speech recognition

    Proc. ICSLP-2000

    (2000)
  • J Barker et al.

    Robust ASR based on clean speech models: An evaluation of missing data techniques for connected digit recognition in noise

    Proc. Eurospeech-2001

    (2001)
  • Barker, J., Green, P.D., Cooke, M.P., 2001b. Linking auditory scene analysis and robust ASR by missing data techniques,...
  • J.S Bradley

    Predictors of speech intelligibility in rooms

    J. Acoust. Soc. Am.

    (1986)
  • A.S Bregman

    Auditory Scene Analysis

    (1990)
  • G.J Brown et al.

    A neural oscillator sound separator for missing data speech recognition

    Proc. IJCNN-2001

    (2001)
  • R.A Cole et al.

    New telephone speech corpora at CSLU

    Proc. Eurospeech-1995

    (1995)
  • M.P Cooke

    Modelling Auditory Processing and Organization

    (1993)
  • S.P Davis et al.

    Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences

    IEEE Trans. Acoust. Speech Signal Process.

    (1980)
  • J Droppo et al.

    Uncertainty decoding with SPLICE for noise robust speech recognition

    Proc. ICASSP-2002

    (2002)
  • R Drullman et al.

    Effects of temporal envelope smearing on speech reception

    J. Acoust. Soc. Amer.

    (1994)
  • Dupont, S., Ris C., 1999. Assessing local noise level estimation methods. In: Proc. of Workshop on Robust Methods for...
  • A Eronen et al.

    Audio context awareness––acoustic modeling and perceptual evaluation

    Proc. ICASSP-2003

    (2003)
  • B Gold et al.

    Speech and Audio Signal Processing

    (2000)
  • H Gölzer et al.

    Importance of early and late reflections for automatic speech recognition in reverberant environments

    Proc. Elektronische Sprachsignalverarbeitung (ESSV)

    (2003)
  • H Hermansky

    Perceptual linear predictive (PLP) analysis of speech

    J. Acoust. Soc. Am.

    (1990)
  • H Hermansky et al.

    RASTA processing of speech

    IEEE Trans. Speech Audio Proc.

    (1994)
  • Cited by (59)

    • Robust binaural speech separation in adverse conditions based on deep neural network with modified spatial features and training target

      2019, Speech Communication
      Citation Excerpt :

      In addition, the ITD feature is extracted from the GCC-PHAT algorithm (Knapp and Carter, 1976) and concatenated with ILD, leading to the feature, “ITD(GCC-PHAT) + ILD”. As another comparison, binary missing data masking and imputation using the method described in Palomäki et al. (2004a) is used to extract the binaural feature, called “bmITD(CCF) + bmILD”. Binary masking preserves the reliable spectro-temporal areas and discards unreliable regions (Palomäki et al., 2004a).

    • Advanced parallel combined Gaussian mixture model based feature compensation integrated with iterative channel estimation

      2015, Speech Communication
      Citation Excerpt :

      Various model adaptation techniques have been successfully employed such as the Maximum A Posteriori (MAP), Maximum Likelihood Linear Regression (MLLR) and Parallel Model Combination (PMC) (Gauvain and Lee, 1994; Leggetter and Woodland, 1995; Gales and Young, 1996; Gales, 1998). Recently, missing-feature methods have shown promising results (Barker et al., 2001; Cook et al., 2001; Palomaki et al., 2004; Raj et al., 2004; Kim and Hansen, 2009b). This study builds on work from our previously proposed feature compensation method based on Parallel Combined Gaussian Mixture Model (PCGMM, Kim and Hansen, 2009a).

    • A new framework for robust speech recognition in complex channel environments

      2014, Digital Signal Processing: A Review Journal
      Citation Excerpt :

      Palomaki et al. [55] combine spectral features and cepstral features within the missing data framework to handle convolutional distortion and additive noise. They also propose a method for handling reverberated speech which attempts to identify time-frequency regions that are not badly contaminated by reverberation [56]. It is important to consider the two above-mentioned complex situations because they are widely present in real-life applications, such as speech recognition over the Internet or in distributed environments.

    View all citing articles on Scopus
    View full text