Filtering the time sequences of spectral parameters for speech recognition12
Introduction
The first step in the pattern matching approach to the problem of speech recognition is to convert a speech waveform into a sequence of features, usually in the form of spectral parameters (Rabiner and Juang, 1993). Speech signals are usually modeled as the output of a time-varying filter driven by a signal whose spectrum is essentially either flat or a train of spectral lines of equal power. Consequently, on a short-time basis, the envelope of the speech spectrum represents the instantaneous spectral response of the filter whose characteristics are the determining factor of the identity of a speech sound or a speech utterance. Conventionally, speech spectral envelopes are represented by means of all-pole models or various forms of periodogram-based estimators, and often are expressed in terms of the corresponding cepstral coefficients (Picone, 1993).
These representations are calculated via short-time spectral analysis. Let the sampled speech signal be s(l). A window function w(l) is applied to it at regular intervals nN0, n=…,−1,0,1,2,… to form frames of windowed signal s(l)w(nN0−l). The window function is usually of finite duration L0. Spectral analysis techniques are then used to obtain a short-time spectral estimate for each signal frame which is represented with Q parameters (Q may be the order of the all-pole model, or the number of frequency bands of the periodogram-based estimators). In the general case, the set of parameters of each frame is transformed into a new representation (e.g., cepstral coefficients) that is better adapted to the speech classifier, which will use the spectral information to decide which speech unit or utterance has been said. Thus, this signal modelling process results in a set of time sequences of spectral parameters that represent the temporal evolution of the spectral response of the time-varying filter. We shall refer to each time sequence of spectral parameters as TSSP, and we will hereafter assume that the spectral parameters are the common cepstral coefficients, although most of the derivations and results would also be valid for parameters in the logarithmic spectral domain or any linear transformation of them.
There are certain inherent limitations in this type of speech signal representation. First, spectral estimation based on finite data involves a certain random estimation error. Moreover, in speech spectral estimation, the relative positioning of each frame with respect to pitch periods introduces an additional estimation error. There is a tradeoff between the variance or the power of these errors and the time resolution of the spectral estimator that is mainly controlled by the window length L0 (Nadeu and Juang, 1994). A similar tradeoff exists between estimation error and frequency resolution. For a given L0, the number of spectral parameters Q determines that tradeoff for each estimator.
As the object of the present work is the time evolution of the speech spectral representations for speech recognition, we are interested in the shortcomings of the estimators concerning time resolution. The temporal resolution should be high enough to allow an accurate tracking of the time-varying filter characteristics, but this becomes difficult in fast transitions of speech signals. Additionally, the rigid frame-to-frame working mode does not make it easy to model the inherent dynamics of the speech signal. This problem is compounded by the independence of observations assumed in the usual hidden Markov models since it implies that each set of parameters is uncorrelated with those of the surrounding frames, except through the Markov chain.
Apart from the error due to the above limitations of the spectral estimation process, every TSSP carries more information from the speech signal than its mere phonetic content, such as speaker characteristics, acoustic distortion and noise. As these factors are sources of recognition errors, a suitable speech signal model should be robust to them.
In recent years, filtering of the TSSP has been extensively employed as a simple way of attempting to partially overcome these temporal limitations (see Hanson et al. (1996)for a survey of filtering techniques). Both dynamic features (Furui, 1986) and RASTA-type processing (Hermansky and Morgan, 1994; Hirsch et al., 1991) use linear filters to obtain more robust and more discriminative speech representations. Thus, the usual signal modelling process preceding pattern matching is such that in Fig. 1.
Filtering is convenient to remove, from the logarithmic spectral parameters, the slowly varying linear distortion due to microphone, telephone channel, etc. that is present in the speech signal. This fact is easily understandable from a frequency analysis point of view. However, few attempts exist that, by using frequency analysis in some way, try to gain insight into the characteristics of the filtered parameters that are employed as dynamic features, despite their generalized usage. Explanations of the excellent performance of this supplementary filtered parameters are usually based on the idea of successive smoothed derivatives that capture the temporal change of the spectral parameters.3
In this paper, we will try to obtain a better understanding of parameter filtering by resorting to frequency analysis and linear filter theory, and by making use of the long-term spectrum of the TSSP. The frequency variable of the spectrum, which is the Fourier counterpart of the frame index n, has been called modulation frequency in a subband analysis framework (Houtgast and Steeneken, 1985), since it corresponds to the envelope variation rate, and also in a general sense, to describe the rate of change of any spectral parameter representation (Hanson et al., 1996). If the frame shift equals 10 ms, there are spectral components at frequencies up to 50 Hz, half of the analysis frame rate. In principle, the modulation frequency could actually play a meaningful role in speech recognition since statistic measures defined on it have been associated with speech intelligibility in several human auditory perception studies. We will return to it in Section 7.1.
The present work started from an initial observation: whereas the passbands of the frequency responses of the various filters employed so far for filtering the TSSP in similar recognition tasks are quite diverse, the high-power bands of the TSSP spectra of the filtered sequences show a noticeable similarity. In other words, the long-term spectrum of the TSSP decays along the modulation frequency and the various filters have in common a rising slope which equalizes that decaying spectral curve in a certain band. Such an observation led to a series of discussions and recognition experiments whose results are reported in this paper.
This paper is organized as follows. The long-term spectrum of the TSSP is presented in Section 2. The spectral effects of the various types of filters reported in the literature are explained in Section 3and they are related to the HMM formalism in Section 4. After proposing in Section 5a new filtering scheme that is based on the above observations, some recognition experiments are reported in Section 6in order to validate the spectral approach. After they are discussed in Section 7, some questions arise which are tackled in the next sections. A certain reduction of speaker variability performed by the filter is shown in Section 8. The conventional cepstral mean subtraction technique is interpreted in terms of filtering in Section 9, and this permits us to discuss the role of the filter length and the dependence on the speaking rate. Finally, in Section 10, filtering is applied to short (subword) units in order to make apparent the effect of filtering on the unit boundaries with and without context modelling.
Section snippets
Spectrum of the TSSP
Let logS(ω,n) be the short-time log spectral estimate of the speech signal with n denoting the frame index and ω the frequency. We shall use cepstrum cm(n) as the representation of logS(ω,n), i.e.,due to its widespread use in speech recognition applications. Note that the Fourier transform of the time sequence of the mth cepstral coefficient cm(n) iswhere the modulation frequency variable θ is the Fourier counterpart of the frame index n.
Let us
Filtering of the TSSP
Dynamic features of speech in the form of differential parameters are extensively employed in speech and speaker recognition systems. The differential parameters are usually analyzed in the time domain, as successive derivatives that capture the change of the TSSP (Furui, 1986) (Taylor's expansion). However, they can also be envisioned as the output of a linear filter driven by the TSSP. In this sense, these parameters can be referred to as (time) filtered parameters.
Probably the most common
Band equalization and hidden Markov models
We have pointed out that most filters used so far have a zero at z=1 that performs an approximate equalization of the TSSP spectrum. Since the filter also shapes the spectrum of the equalized TSSP by enhancing a band which depends on the purpose of the corresponding filtered feature, either to complement the basic spectral feature or to substitute for it, we will refer to this effect as band equalization. It is illustrated in Fig. 6. Thus, all the modulation frequencies belonging to the
Filter design
The distribution of the filtered parameter spectral bands along a frequency interval 0⩽θ⩽θc⩽θs that lead to the best recognition results may depend on several factors: the number of supplementary features, the type of recognition task (e.g., IWR or CSR), the size of speech units, the speaking rate, the noise characteristics, etc. Consequently, the structure of the filters that compute either supplementary or substitutive parameters should be flexible enough to allow adaptation to these factors.
Experimental results
In order to try to validate the meaningfulness of the spectrum of the filtered TSSP along with the usefulness of the alternative filter structure of Fig. 7(b) and the Slepian filters, we applied the above design method to two speaker-independent word recognition tasks. Tests were conducted using (1) only a filtered set of parameters (one feature), and (2) the unfiltered set and two supplementary filtered sets (three features).
Discussion
In the previous section, we have observed the effects of filtering the TSSP for CDHMM digit recognition and when cepstral parameters are used. As was expected, an improvement in recognition rate was observed both by adding differential parameters and by removing the dc component which is distorted by the telephone channel.
The above tests with differential parameters which are obtained by FIR filters that consist of cascade of a first-order equalizer and a Slepian filter have made more apparent
Reducing the speaker variability
It is a well known fact that the long-term spectrum of speech signals is influenced by the speaker's characteristics. Since long-term spectral characteristics are time-independent or slowly variant, they appear in the low frequency region of the TSSP spectrum T(θ). So as to verify this, we have carried out a few variance measurements for the TI digit database used in this work. For this purpose, we have used all the single digit utterances of the adult portion of the database. Every utterance
Cepstral mean subtraction and speaking rate
Cepstral mean subtraction (CMS) is a widely used technique to cancel linear distortion in speech recognition. It eliminates the zero frequency component of every time sequence of cepstral coefficients by subtracting from each of its (frame) samples the average value in the utterance. So the whole utterance has to be available before performing CMS. A recognition rate increase by using CMS has rarely been reported in the case of clean speech. See (Haeb-Umbach et al., 1993) for a clear
Continuous subword-unit-based speech recognition
Simple IIR or FIR time filters, which significantly improve performance in isolated or connected word recognition tasks, induce spectral transition spreading and a cross-boundary effect, which is critical in continuous speech recognition, where phoneme-sized modelling units are used and filters may worsen recognition results (Hermansky and Morgan, 1994). In this section, we show how the use of context-dependent units reduces the side effects of the filters and may result in improved recognition
Conclusion
In this paper, we have attempted to obtain a better understanding of parameter filtering by resorting to frequency analysis. The analysis of the average long-term spectrum of filtered TSSPs revealed a band equalization effect that emphasizes certain modulation frequency bands. Experimental results showed how the use of properly filtered parameter sequences, with no supplementary parameters, results in improved recognition rates even for clean speech, both using whole-word and subword based
Acknowledgements
The authors wish to thank J.B. Mariño, J. Hernando, E. Lleida, R.C. Rose, F.K. Soong, C.H. Lee and M. Rahim and for their valuable suggestions and stimulating discussions. They also like to express their gratitude to Manuel Toril for his assistance in the experimental work with the TI digit database. The work has been partly funded by the Spanish Government projects TIC95-0884-C04-02 and TIC95-1022-C05-03.
References (31)
- et al.
Feature extraction using a matrix coefficient filter for speech recognition
Speech Communication
(1993) - et al.
Improved acoustic modelling for large vocabulary CSR
Computer Speech and Language
(1992) - et al.
Connected digit recognition based on improved acoustic resolution
Computer, Speech and Language
(1993) - Applebaum, T.H., Hanson, B., 1990. Robust speaker-independent word recognition using spectral smoothing and temporal...
- Arai, T., Pavel, M., Hermansky, H., Avendaño, C., 1996. Intelligibility of speech with filtered time trajectories of...
- Avendaño, C., van Vuuren, S., Hermansky, H., 1996. Data based filter design for RASTA-like channel normalization in...
- Bonafonte, A., Estany, R., Vives, E., 1995. Study of subword units for spanish speech recognition. Proc. Eurospeech'95,...
- et al.
Effect of temporal envelope smearing on speech reception
J. Acoust. Soc. Amer.
(1994) - et al.
Effect of reducing slow temporal modulations speech reception
J. Acoust. Soc. Amer.
(1994) Speaker-independent isolated word recognition using dynamic features of speech spectrum
IEEE Trans. Acoust. Speech Signal Process.
(1986)
RASTA processing of speech
IEEE Trans. Speech Audio Process.
Cited by (36)
Uncertainty-based learning of acoustic models from noisy data
2013, Computer Speech and LanguageCitation Excerpt :At the signal level, one can apply enhancement techniques such as noise suppression (Ephraim, 1992), source separation (Vincent et al., 2012) or dereverberation (Delcroix et al., 2009). At the feature level, one can define features that are robust to the considered type of noise or to the residual noise after enhancement (Nadeu et al., 1997). Finally, at the classifier (or decoder) level, one can account for possible distortion of the features within the classifier itself.
Single-channel speech enhancement using spectral subtraction in the short-time modulation domain
2010, Speech CommunicationBand-pass filtering of the time sequences of spectral parameters for robust wireless speech recognition
2006, Speech CommunicationRobust automatic speech recognition with missing and unreliable acoustic data
2001, Speech CommunicationTime and frequency filtering of filter-bank energies for robust HMM speech recognition
2001, Speech CommunicationCitation Excerpt :The number in the parenthesis is the length of the whole filter. Figs. 6(a) and (b) illustrate the simulated modulation spectra of time-filtered speech parameters, which were obtained by using the spectral response of the filter 1/(1−0.97z−1) as an approximation of the mean modulation spectrum (Nadeu and Juang, 1994; Nadeu et al., 1997a). Training was carried out with clean speech.
- 1
This paper is based on a communication presented at the ESCA Conference EUROSPEECH'95 and has been recommended by the EUROSPEECH'95 Scientific Committee.
- 2
The first part of this work was carried out while the first author was in the former AT&T Bell Laboratories for his sabbatical leave.