Elsevier

Speech Communication

Volume 22, Issue 4, September 1997, Pages 315-332
Speech Communication

Filtering the time sequences of spectral parameters for speech recognition12

https://doi.org/10.1016/S0167-6393(97)00030-7Get rights and content

Abstract

In automatic speech recognition, the signal is usually represented by a set of time sequences of spectral parameters (TSSPs) that model the temporal evolution of the spectral envelope frame-to-frame. Those sequences are then filtered either to make them more robust to environmental conditions or to compute differential parameters (dynamic features) which enhance discrimination. In this paper, we apply frequency analysis to TSSPs in order to provide an interpretation framework for the various types of parameter filters used so far. Thus, the analysis of the average long-term spectrum of the successfully filtered sequences reveals a combined effect of equalization and band selection that provides insights into TSSP filtering. Also, we show in the paper that, when supplementary differential parameters are not used, the recognition rate can be improved even for clean speech, just by properly filtering the TSSPs. To support this claim, a number of experimental results are presented, both using whole-word and subword based models. The empirically optimum filters attenuate the low-pass band and emphasize a higher band so that the peak of the average long-term spectrum of the output of these filters lies at around the average syllable rate of the employed database (≈3 Hz).

Résumé

En reconnaissance de la parole, le signal est habituellement représenté par un ensemble de séquences temporelles de paramètres spectraux (TSSPs) qui modélisent l'évolution temporelle, trame par trame, de l'enveloppe spectrale. Ces séquences sont ensuite filtrées soit pour les rendre plus robustes aux conditions de l'environnement ou pour calculer des paramètres différentiels (indices dynamiques) qui améliorent la discrimination. Dans cet article, nous appliquons une analyse fréquentielle aux TSSPs afin de fournir un cadre d'interprétation pour les divers types de filtres de paramètres utilisés jusqu'à présent. Ainsi, l'analyse du spectre moyen à long-terme des séquences correctement filtrées révèle un effet combiné de l'égalisation et de la sélection de bande qui fournit des informations intéressantes sur le filtrage TSSP. Nous montrons également que, quand des paramètres différentiels supplémentaires ne sont pas utilisés, le taux de reconnaissance peut être amélioré même pour de la parole non bruitée, juste en filtrant les TSSPs de manière appropriée. Pour confirmer cette assertion, un certain nombre de résultats expérimentaux sont fournis, en utilisant tant des modèles de mots que des modèles phonétiques. Les filtres empiriquement optimaux atténuent la bande des basses fréquences et accentuent celle des hautes fréquences, de sorte que le pic du spectre moyen à long-terme de la sortie de ces filtres se situe aux alentours de la vitesse syllabique moyenne de la base de données utilisée (3 Hz, environ).

Introduction

The first step in the pattern matching approach to the problem of speech recognition is to convert a speech waveform into a sequence of features, usually in the form of spectral parameters (Rabiner and Juang, 1993). Speech signals are usually modeled as the output of a time-varying filter driven by a signal whose spectrum is essentially either flat or a train of spectral lines of equal power. Consequently, on a short-time basis, the envelope of the speech spectrum represents the instantaneous spectral response of the filter whose characteristics are the determining factor of the identity of a speech sound or a speech utterance. Conventionally, speech spectral envelopes are represented by means of all-pole models or various forms of periodogram-based estimators, and often are expressed in terms of the corresponding cepstral coefficients (Picone, 1993).

These representations are calculated via short-time spectral analysis. Let the sampled speech signal be s(l). A window function w(l) is applied to it at regular intervals nN0, n=…,−1,0,1,2,… to form frames of windowed signal s(l)w(nN0l). The window function is usually of finite duration L0. Spectral analysis techniques are then used to obtain a short-time spectral estimate for each signal frame which is represented with Q parameters (Q may be the order of the all-pole model, or the number of frequency bands of the periodogram-based estimators). In the general case, the set of parameters of each frame is transformed into a new representation (e.g., cepstral coefficients) that is better adapted to the speech classifier, which will use the spectral information to decide which speech unit or utterance has been said. Thus, this signal modelling process results in a set of time sequences of spectral parameters that represent the temporal evolution of the spectral response of the time-varying filter. We shall refer to each time sequence of spectral parameters as TSSP, and we will hereafter assume that the spectral parameters are the common cepstral coefficients, although most of the derivations and results would also be valid for parameters in the logarithmic spectral domain or any linear transformation of them.

There are certain inherent limitations in this type of speech signal representation. First, spectral estimation based on finite data involves a certain random estimation error. Moreover, in speech spectral estimation, the relative positioning of each frame with respect to pitch periods introduces an additional estimation error. There is a tradeoff between the variance or the power of these errors and the time resolution of the spectral estimator that is mainly controlled by the window length L0 (Nadeu and Juang, 1994). A similar tradeoff exists between estimation error and frequency resolution. For a given L0, the number of spectral parameters Q determines that tradeoff for each estimator.

As the object of the present work is the time evolution of the speech spectral representations for speech recognition, we are interested in the shortcomings of the estimators concerning time resolution. The temporal resolution should be high enough to allow an accurate tracking of the time-varying filter characteristics, but this becomes difficult in fast transitions of speech signals. Additionally, the rigid frame-to-frame working mode does not make it easy to model the inherent dynamics of the speech signal. This problem is compounded by the independence of observations assumed in the usual hidden Markov models since it implies that each set of parameters is uncorrelated with those of the surrounding frames, except through the Markov chain.

Apart from the error due to the above limitations of the spectral estimation process, every TSSP carries more information from the speech signal than its mere phonetic content, such as speaker characteristics, acoustic distortion and noise. As these factors are sources of recognition errors, a suitable speech signal model should be robust to them.

In recent years, filtering of the TSSP has been extensively employed as a simple way of attempting to partially overcome these temporal limitations (see Hanson et al. (1996)for a survey of filtering techniques). Both dynamic features (Furui, 1986) and RASTA-type processing (Hermansky and Morgan, 1994; Hirsch et al., 1991) use linear filters to obtain more robust and more discriminative speech representations. Thus, the usual signal modelling process preceding pattern matching is such that in Fig. 1.

Filtering is convenient to remove, from the logarithmic spectral parameters, the slowly varying linear distortion due to microphone, telephone channel, etc. that is present in the speech signal. This fact is easily understandable from a frequency analysis point of view. However, few attempts exist that, by using frequency analysis in some way, try to gain insight into the characteristics of the filtered parameters that are employed as dynamic features, despite their generalized usage. Explanations of the excellent performance of this supplementary filtered parameters are usually based on the idea of successive smoothed derivatives that capture the temporal change of the spectral parameters.3

In this paper, we will try to obtain a better understanding of parameter filtering by resorting to frequency analysis and linear filter theory, and by making use of the long-term spectrum of the TSSP. The frequency variable of the spectrum, which is the Fourier counterpart of the frame index n, has been called modulation frequency in a subband analysis framework (Houtgast and Steeneken, 1985), since it corresponds to the envelope variation rate, and also in a general sense, to describe the rate of change of any spectral parameter representation (Hanson et al., 1996). If the frame shift equals 10 ms, there are spectral components at frequencies up to 50 Hz, half of the analysis frame rate. In principle, the modulation frequency could actually play a meaningful role in speech recognition since statistic measures defined on it have been associated with speech intelligibility in several human auditory perception studies. We will return to it in Section 7.1.

The present work started from an initial observation: whereas the passbands of the frequency responses of the various filters employed so far for filtering the TSSP in similar recognition tasks are quite diverse, the high-power bands of the TSSP spectra of the filtered sequences show a noticeable similarity. In other words, the long-term spectrum of the TSSP decays along the modulation frequency and the various filters have in common a rising slope which equalizes that decaying spectral curve in a certain band. Such an observation led to a series of discussions and recognition experiments whose results are reported in this paper.

This paper is organized as follows. The long-term spectrum of the TSSP is presented in Section 2. The spectral effects of the various types of filters reported in the literature are explained in Section 3and they are related to the HMM formalism in Section 4. After proposing in Section 5a new filtering scheme that is based on the above observations, some recognition experiments are reported in Section 6in order to validate the spectral approach. After they are discussed in Section 7, some questions arise which are tackled in the next sections. A certain reduction of speaker variability performed by the filter is shown in Section 8. The conventional cepstral mean subtraction technique is interpreted in terms of filtering in Section 9, and this permits us to discuss the role of the filter length and the dependence on the speaking rate. Finally, in Section 10, filtering is applied to short (subword) units in order to make apparent the effect of filtering on the unit boundaries with and without context modelling.

Section snippets

Spectrum of the TSSP

Let logS(ω,n) be the short-time log spectral estimate of the speech signal with n denoting the frame index and ω the frequency. We shall use cepstrum cm(n) as the representation of logS(ω,n), i.e.,cm(n)=12πππlogS(ω,n)ejωmdω,due to its widespread use in speech recognition applications. Note that the Fourier transform of the time sequence of the mth cepstral coefficient cm(n) isCm(θ)=ncm(n)ej,where the modulation frequency variable θ is the Fourier counterpart of the frame index n.

Let us

Filtering of the TSSP

Dynamic features of speech in the form of differential parameters are extensively employed in speech and speaker recognition systems. The differential parameters are usually analyzed in the time domain, as successive derivatives that capture the change of the TSSP (Furui, 1986) (Taylor's expansion). However, they can also be envisioned as the output of a linear filter driven by the TSSP. In this sense, these parameters can be referred to as (time) filtered parameters.

Probably the most common

Band equalization and hidden Markov models

We have pointed out that most filters used so far have a zero at z=1 that performs an approximate equalization of the TSSP spectrum. Since the filter also shapes the spectrum of the equalized TSSP by enhancing a band which depends on the purpose of the corresponding filtered feature, either to complement the basic spectral feature or to substitute for it, we will refer to this effect as band equalization. It is illustrated in Fig. 6. Thus, all the modulation frequencies belonging to the

Filter design

The distribution of the filtered parameter spectral bands along a frequency interval 0⩽θθcθs that lead to the best recognition results may depend on several factors: the number of supplementary features, the type of recognition task (e.g., IWR or CSR), the size of speech units, the speaking rate, the noise characteristics, etc. Consequently, the structure of the filters that compute either supplementary or substitutive parameters should be flexible enough to allow adaptation to these factors.

Experimental results

In order to try to validate the meaningfulness of the spectrum of the filtered TSSP along with the usefulness of the alternative filter structure of Fig. 7(b) and the Slepian filters, we applied the above design method to two speaker-independent word recognition tasks. Tests were conducted using (1) only a filtered set of parameters (one feature), and (2) the unfiltered set and two supplementary filtered sets (three features).

Discussion

In the previous section, we have observed the effects of filtering the TSSP for CDHMM digit recognition and when cepstral parameters are used. As was expected, an improvement in recognition rate was observed both by adding differential parameters and by removing the dc component which is distorted by the telephone channel.

The above tests with differential parameters which are obtained by FIR filters that consist of cascade of a first-order equalizer and a Slepian filter have made more apparent

Reducing the speaker variability

It is a well known fact that the long-term spectrum of speech signals is influenced by the speaker's characteristics. Since long-term spectral characteristics are time-independent or slowly variant, they appear in the low frequency region of the TSSP spectrum T(θ). So as to verify this, we have carried out a few variance measurements for the TI digit database used in this work. For this purpose, we have used all the single digit utterances of the adult portion of the database. Every utterance

Cepstral mean subtraction and speaking rate

Cepstral mean subtraction (CMS) is a widely used technique to cancel linear distortion in speech recognition. It eliminates the zero frequency component of every time sequence of cepstral coefficients by subtracting from each of its (frame) samples the average value in the utterance. So the whole utterance has to be available before performing CMS. A recognition rate increase by using CMS has rarely been reported in the case of clean speech. See (Haeb-Umbach et al., 1993) for a clear

Continuous subword-unit-based speech recognition

Simple IIR or FIR time filters, which significantly improve performance in isolated or connected word recognition tasks, induce spectral transition spreading and a cross-boundary effect, which is critical in continuous speech recognition, where phoneme-sized modelling units are used and filters may worsen recognition results (Hermansky and Morgan, 1994). In this section, we show how the use of context-dependent units reduces the side effects of the filters and may result in improved recognition

Conclusion

In this paper, we have attempted to obtain a better understanding of parameter filtering by resorting to frequency analysis. The analysis of the average long-term spectrum of filtered TSSPs revealed a band equalization effect that emphasizes certain modulation frequency bands. Experimental results showed how the use of properly filtered parameter sequences, with no supplementary parameters, results in improved recognition rates even for clean speech, both using whole-word and subword based

Acknowledgements

The authors wish to thank J.B. Mariño, J. Hernando, E. Lleida, R.C. Rose, F.K. Soong, C.H. Lee and M. Rahim and for their valuable suggestions and stimulating discussions. They also like to express their gratitude to Manuel Toril for his assistance in the experimental work with the TI digit database. The work has been partly funded by the Spanish Government projects TIC95-0884-C04-02 and TIC95-1022-C05-03.

References (31)

  • K. Katagishi et al.

    Feature extraction using a matrix coefficient filter for speech recognition

    Speech Communication

    (1993)
  • C.-H. Lee et al.

    Improved acoustic modelling for large vocabulary CSR

    Computer Speech and Language

    (1992)
  • J. Wilpon et al.

    Connected digit recognition based on improved acoustic resolution

    Computer, Speech and Language

    (1993)
  • Applebaum, T.H., Hanson, B., 1990. Robust speaker-independent word recognition using spectral smoothing and temporal...
  • Arai, T., Pavel, M., Hermansky, H., Avendaño, C., 1996. Intelligibility of speech with filtered time trajectories of...
  • Avendaño, C., van Vuuren, S., Hermansky, H., 1996. Data based filter design for RASTA-like channel normalization in...
  • Bonafonte, A., Estany, R., Vives, E., 1995. Study of subword units for spanish speech recognition. Proc. Eurospeech'95,...
  • R. Drullman et al.

    Effect of temporal envelope smearing on speech reception

    J. Acoust. Soc. Amer.

    (1994)
  • R. Drullman et al.

    Effect of reducing slow temporal modulations speech reception

    J. Acoust. Soc. Amer.

    (1994)
  • S. Furui

    Speaker-independent isolated word recognition using dynamic features of speech spectrum

    IEEE Trans. Acoust. Speech Signal Process.

    (1986)
  • Greenberg, S., Kingsbury, B., 1997. The modulation spectrogram: in pursuit of an invariant representation of speech....
  • Haeb-Umbach, R., Geller, D., Ney, H., 1993. Improvements in connected digit recognition using linear discriminant...
  • Hanson, B.A., Applebaum, T.H., Junqua, J.C., 1996. Spectral dynamics for speech recognition under adverse conditions....
  • H. Hermansky et al.

    RASTA processing of speech

    IEEE Trans. Speech Audio Process.

    (1994)
  • Hermansky, H., Avendaño, C., van Vuuren, S., Tibrewala, S., 1997. Recent advances in addressing sources of...
  • Cited by (36)

    • Uncertainty-based learning of acoustic models from noisy data

      2013, Computer Speech and Language
      Citation Excerpt :

      At the signal level, one can apply enhancement techniques such as noise suppression (Ephraim, 1992), source separation (Vincent et al., 2012) or dereverberation (Delcroix et al., 2009). At the feature level, one can define features that are robust to the considered type of noise or to the residual noise after enhancement (Nadeu et al., 1997). Finally, at the classifier (or decoder) level, one can account for possible distortion of the features within the classifier itself.

    • Time and frequency filtering of filter-bank energies for robust HMM speech recognition

      2001, Speech Communication
      Citation Excerpt :

      The number in the parenthesis is the length of the whole filter. Figs. 6(a) and (b) illustrate the simulated modulation spectra of time-filtered speech parameters, which were obtained by using the spectral response of the filter 1/(1−0.97z−1) as an approximation of the mean modulation spectrum (Nadeu and Juang, 1994; Nadeu et al., 1997a). Training was carried out with clean speech.

    View all citing articles on Scopus
    1

    This paper is based on a communication presented at the ESCA Conference EUROSPEECH'95 and has been recommended by the EUROSPEECH'95 Scientific Committee.

    2

    The first part of this work was carried out while the first author was in the former AT&T Bell Laboratories for his sabbatical leave.

    View full text