Filtering the time sequences of spectral parameters for speech recognition

doi:10.1016/S0167-6393(97)00030-7

Speech Communication

Volume 22, Issue 4, September 1997, Pages 315-332

https://doi.org/10.1016/S0167-6393(97)00030-7 Get rights and content

Abstract

In automatic speech recognition, the signal is usually represented by a set of time sequences of spectral parameters (TSSPs) that model the temporal evolution of the spectral envelope frame-to-frame. Those sequences are then filtered either to make them more robust to environmental conditions or to compute differential parameters (dynamic features) which enhance discrimination. In this paper, we apply frequency analysis to TSSPs in order to provide an interpretation framework for the various types of parameter filters used so far. Thus, the analysis of the average long-term spectrum of the successfully filtered sequences reveals a combined effect of equalization and band selection that provides insights into TSSP filtering. Also, we show in the paper that, when supplementary differential parameters are not used, the recognition rate can be improved even for clean speech, just by properly filtering the TSSPs. To support this claim, a number of experimental results are presented, both using whole-word and subword based models. The empirically optimum filters attenuate the low-pass band and emphasize a higher band so that the peak of the average long-term spectrum of the output of these filters lies at around the average syllable rate of the employed database (≈3 Hz).

Résumé

En reconnaissance de la parole, le signal est habituellement représenté par un ensemble de séquences temporelles de paramètres spectraux (TSSPs) qui modélisent l'évolution temporelle, trame par trame, de l'enveloppe spectrale. Ces séquences sont ensuite filtrées soit pour les rendre plus robustes aux conditions de l'environnement ou pour calculer des paramètres différentiels (indices dynamiques) qui améliorent la discrimination. Dans cet article, nous appliquons une analyse fréquentielle aux TSSPs afin de fournir un cadre d'interprétation pour les divers types de filtres de paramètres utilisés jusqu'à présent. Ainsi, l'analyse du spectre moyen à long-terme des séquences correctement filtrées révèle un effet combiné de l'égalisation et de la sélection de bande qui fournit des informations intéressantes sur le filtrage TSSP. Nous montrons également que, quand des paramètres différentiels supplémentaires ne sont pas utilisés, le taux de reconnaissance peut être amélioré même pour de la parole non bruitée, juste en filtrant les TSSPs de manière appropriée. Pour confirmer cette assertion, un certain nombre de résultats expérimentaux sont fournis, en utilisant tant des modèles de mots que des modèles phonétiques. Les filtres empiriquement optimaux atténuent la bande des basses fréquences et accentuent celle des hautes fréquences, de sorte que le pic du spectre moyen à long-terme de la sortie de ces filtres se situe aux alentours de la vitesse syllabique moyenne de la base de données utilisée (3 Hz, environ).

Introduction

The first step in the pattern matching approach to the problem of speech recognition is to convert a speech waveform into a sequence of features, usually in the form of spectral parameters (Rabiner and Juang, 1993). Speech signals are usually modeled as the output of a time-varying filter driven by a signal whose spectrum is essentially either flat or a train of spectral lines of equal power. Consequently, on a short-time basis, the envelope of the speech spectrum represents the instantaneous spectral response of the filter whose characteristics are the determining factor of the identity of a speech sound or a speech utterance. Conventionally, speech spectral envelopes are represented by means of all-pole models or various forms of periodogram-based estimators, and often are expressed in terms of the corresponding cepstral coefficients (Picone, 1993).

These representations are calculated via short-time spectral analysis. Let the sampled speech signal be s(l). A window function w(l) is applied to it at regular intervals nN₀, n=…,−1,0,1,2,… to form frames of windowed signal s(l)w(nN₀−l). The window function is usually of finite duration L₀. Spectral analysis techniques are then used to obtain a short-time spectral estimate for each signal frame which is represented with Q parameters (Q may be the order of the all-pole model, or the number of frequency bands of the periodogram-based estimators). In the general case, the set of parameters of each frame is transformed into a new representation (e.g., cepstral coefficients) that is better adapted to the speech classifier, which will use the spectral information to decide which speech unit or utterance has been said. Thus, this signal modelling process results in a set of time sequences of spectral parameters that represent the temporal evolution of the spectral response of the time-varying filter. We shall refer to each time sequence of spectral parameters as TSSP, and we will hereafter assume that the spectral parameters are the common cepstral coefficients, although most of the derivations and results would also be valid for parameters in the logarithmic spectral domain or any linear transformation of them.

There are certain inherent limitations in this type of speech signal representation. First, spectral estimation based on finite data involves a certain random estimation error. Moreover, in speech spectral estimation, the relative positioning of each frame with respect to pitch periods introduces an additional estimation error. There is a tradeoff between the variance or the power of these errors and the time resolution of the spectral estimator that is mainly controlled by the window length L₀ (Nadeu and Juang, 1994). A similar tradeoff exists between estimation error and frequency resolution. For a given L₀, the number of spectral parameters Q determines that tradeoff for each estimator.

As the object of the present work is the time evolution of the speech spectral representations for speech recognition, we are interested in the shortcomings of the estimators concerning time resolution. The temporal resolution should be high enough to allow an accurate tracking of the time-varying filter characteristics, but this becomes difficult in fast transitions of speech signals. Additionally, the rigid frame-to-frame working mode does not make it easy to model the inherent dynamics of the speech signal. This problem is compounded by the independence of observations assumed in the usual hidden Markov models since it implies that each set of parameters is uncorrelated with those of the surrounding frames, except through the Markov chain.

Apart from the error due to the above limitations of the spectral estimation process, every TSSP carries more information from the speech signal than its mere phonetic content, such as speaker characteristics, acoustic distortion and noise. As these factors are sources of recognition errors, a suitable speech signal model should be robust to them.

In recent years, filtering of the TSSP has been extensively employed as a simple way of attempting to partially overcome these temporal limitations (see Hanson et al. (1996)for a survey of filtering techniques). Both dynamic features (Furui, 1986) and RASTA-type processing (Hermansky and Morgan, 1994; Hirsch et al., 1991) use linear filters to obtain more robust and more discriminative speech representations. Thus, the usual signal modelling process preceding pattern matching is such that in Fig. 1.

Filtering is convenient to remove, from the logarithmic spectral parameters, the slowly varying linear distortion due to microphone, telephone channel, etc. that is present in the speech signal. This fact is easily understandable from a frequency analysis point of view. However, few attempts exist that, by using frequency analysis in some way, try to gain insight into the characteristics of the filtered parameters that are employed as dynamic features, despite their generalized usage. Explanations of the excellent performance of this supplementary filtered parameters are usually based on the idea of successive smoothed derivatives that capture the temporal change of the spectral parameters.³

In this paper, we will try to obtain a better understanding of parameter filtering by resorting to frequency analysis and linear filter theory, and by making use of the long-term spectrum of the TSSP. The frequency variable of the spectrum, which is the Fourier counterpart of the frame index n, has been called modulation frequency in a subband analysis framework (Houtgast and Steeneken, 1985), since it corresponds to the envelope variation rate, and also in a general sense, to describe the rate of change of any spectral parameter representation (Hanson et al., 1996). If the frame shift equals 10 ms, there are spectral components at frequencies up to 50 Hz, half of the analysis frame rate. In principle, the modulation frequency could actually play a meaningful role in speech recognition since statistic measures defined on it have been associated with speech intelligibility in several human auditory perception studies. We will return to it in Section 7.1.

The present work started from an initial observation: whereas the passbands of the frequency responses of the various filters employed so far for filtering the TSSP in similar recognition tasks are quite diverse, the high-power bands of the TSSP spectra of the filtered sequences show a noticeable similarity. In other words, the long-term spectrum of the TSSP decays along the modulation frequency and the various filters have in common a rising slope which equalizes that decaying spectral curve in a certain band. Such an observation led to a series of discussions and recognition experiments whose results are reported in this paper.

This paper is organized as follows. The long-term spectrum of the TSSP is presented in Section 2. The spectral effects of the various types of filters reported in the literature are explained in Section 3and they are related to the HMM formalism in Section 4. After proposing in Section 5a new filtering scheme that is based on the above observations, some recognition experiments are reported in Section 6in order to validate the spectral approach. After they are discussed in Section 7, some questions arise which are tackled in the next sections. A certain reduction of speaker variability performed by the filter is shown in Section 8. The conventional cepstral mean subtraction technique is interpreted in terms of filtering in Section 9, and this permits us to discuss the role of the filter length and the dependence on the speaking rate. Finally, in Section 10, filtering is applied to short (subword) units in order to make apparent the effect of filtering on the unit boundaries with and without context modelling.

Section snippets

Spectrum of the TSSP

Let logS(ω,n) be the short-time log spectral estimate of the speech signal with n denoting the frame index and ω the frequency. We shall use cepstrum c_m(n) as the representation of logS(ω,n), i.e., $c_{m} (n)= 12 π ∫ − π π log S(ω,n)e^{jωm} d ω,$ due to its widespread use in speech recognition applications. Note that the Fourier transform of the time sequence of the mth cepstral coefficient c_m(n) is $C_{m} (θ)= ∑ n c_{m} (n)e^{−jnθ},$ where the modulation frequency variable θ is the Fourier counterpart of the frame index n.

Let us

Filtering of the TSSP

Dynamic features of speech in the form of differential parameters are extensively employed in speech and speaker recognition systems. The differential parameters are usually analyzed in the time domain, as successive derivatives that capture the change of the TSSP (Furui, 1986) (Taylor's expansion). However, they can also be envisioned as the output of a linear filter driven by the TSSP. In this sense, these parameters can be referred to as (time) filtered parameters.

Probably the most common

Band equalization and hidden Markov models

We have pointed out that most filters used so far have a zero at z=1 that performs an approximate equalization of the TSSP spectrum. Since the filter also shapes the spectrum of the equalized TSSP by enhancing a band which depends on the purpose of the corresponding filtered feature, either to complement the basic spectral feature or to substitute for it, we will refer to this effect as band equalization. It is illustrated in Fig. 6. Thus, all the modulation frequencies belonging to the

Filter design

The distribution of the filtered parameter spectral bands along a frequency interval 0⩽θ⩽θ_c⩽θ_s that lead to the best recognition results may depend on several factors: the number of supplementary features, the type of recognition task (e.g., IWR or CSR), the size of speech units, the speaking rate, the noise characteristics, etc. Consequently, the structure of the filters that compute either supplementary or substitutive parameters should be flexible enough to allow adaptation to these factors.

Experimental results

In order to try to validate the meaningfulness of the spectrum of the filtered TSSP along with the usefulness of the alternative filter structure of Fig. 7(b) and the Slepian filters, we applied the above design method to two speaker-independent word recognition tasks. Tests were conducted using (1) only a filtered set of parameters (one feature), and (2) the unfiltered set and two supplementary filtered sets (three features).

Discussion

In the previous section, we have observed the effects of filtering the TSSP for CDHMM digit recognition and when cepstral parameters are used. As was expected, an improvement in recognition rate was observed both by adding differential parameters and by removing the dc component which is distorted by the telephone channel.

The above tests with differential parameters which are obtained by FIR filters that consist of cascade of a first-order equalizer and a Slepian filter have made more apparent

Reducing the speaker variability

It is a well known fact that the long-term spectrum of speech signals is influenced by the speaker's characteristics. Since long-term spectral characteristics are time-independent or slowly variant, they appear in the low frequency region of the TSSP spectrum T(θ). So as to verify this, we have carried out a few variance measurements for the TI digit database used in this work. For this purpose, we have used all the single digit utterances of the adult portion of the database. Every utterance

Cepstral mean subtraction and speaking rate

Cepstral mean subtraction (CMS) is a widely used technique to cancel linear distortion in speech recognition. It eliminates the zero frequency component of every time sequence of cepstral coefficients by subtracting from each of its (frame) samples the average value in the utterance. So the whole utterance has to be available before performing CMS. A recognition rate increase by using CMS has rarely been reported in the case of clean speech. See (Haeb-Umbach et al., 1993) for a clear

Continuous subword-unit-based speech recognition

Simple IIR or FIR time filters, which significantly improve performance in isolated or connected word recognition tasks, induce spectral transition spreading and a cross-boundary effect, which is critical in continuous speech recognition, where phoneme-sized modelling units are used and filters may worsen recognition results (Hermansky and Morgan, 1994). In this section, we show how the use of context-dependent units reduces the side effects of the filters and may result in improved recognition

Conclusion

In this paper, we have attempted to obtain a better understanding of parameter filtering by resorting to frequency analysis. The analysis of the average long-term spectrum of filtered TSSPs revealed a band equalization effect that emphasizes certain modulation frequency bands. Experimental results showed how the use of properly filtered parameter sequences, with no supplementary parameters, results in improved recognition rates even for clean speech, both using whole-word and subword based

Acknowledgements

The authors wish to thank J.B. Mariño, J. Hernando, E. Lleida, R.C. Rose, F.K. Soong, C.H. Lee and M. Rahim and for their valuable suggestions and stimulating discussions. They also like to express their gratitude to Manuel Toril for his assistance in the experimental work with the TI digit database. The work has been partly funded by the Spanish Government projects TIC95-0884-C04-02 and TIC95-1022-C05-03.

References (31)

K. Katagishi et al.
Feature extraction using a matrix coefficient filter for speech recognition
Speech Communication
(1993)
C.-H. Lee et al.
Improved acoustic modelling for large vocabulary CSR
Computer Speech and Language
(1992)
J. Wilpon et al.
Connected digit recognition based on improved acoustic resolution
Computer, Speech and Language
(1993)
Applebaum, T.H., Hanson, B., 1990. Robust speaker-independent word recognition using spectral smoothing and temporal...
Arai, T., Pavel, M., Hermansky, H., Avendaño, C., 1996. Intelligibility of speech with filtered time trajectories of...
Avendaño, C., van Vuuren, S., Hermansky, H., 1996. Data based filter design for RASTA-like channel normalization in...
Bonafonte, A., Estany, R., Vives, E., 1995. Study of subword units for spanish speech recognition. Proc. Eurospeech'95,...
R. Drullman et al.
Effect of temporal envelope smearing on speech reception
J. Acoust. Soc. Amer.
(1994)
R. Drullman et al.
Effect of reducing slow temporal modulations speech reception
J. Acoust. Soc. Amer.
(1994)
S. Furui
Speaker-independent isolated word recognition using dynamic features of speech spectrum
IEEE Trans. Acoust. Speech Signal Process.
(1986)

Greenberg, S., Kingsbury, B., 1997. The modulation spectrogram: in pursuit of an invariant representation of speech....

Haeb-Umbach, R., Geller, D., Ney, H., 1993. Improvements in connected digit recognition using linear discriminant...

Hanson, B.A., Applebaum, T.H., Junqua, J.C., 1996. Spectral dynamics for speech recognition under adverse conditions....

H. Hermansky et al.

RASTA processing of speech

IEEE Trans. Speech Audio Process.

(1994)

Hermansky, H., Avendaño, C., van Vuuren, S., Tibrewala, S., 1997. Recent advances in addressing sources of...

Cited by (36)

Uncertainty-based learning of acoustic models from noisy data
2013, Computer Speech and Language
Citation Excerpt :
At the signal level, one can apply enhancement techniques such as noise suppression (Ephraim, 1992), source separation (Vincent et al., 2012) or dereverberation (Delcroix et al., 2009). At the feature level, one can define features that are robust to the considered type of noise or to the residual noise after enhancement (Nadeu et al., 1997). Finally, at the classifier (or decoder) level, one can account for possible distortion of the features within the classifier itself.
We consider the problem of acoustic modeling of noisy speech data, where the uncertainty over the data is given by a Gaussian distribution. While this uncertainty has been exploited at the decoding stage via uncertainty decoding, its usage at the training stage remains limited to static model adaptation. We introduce a new expectation maximization (EM) based technique, which we call uncertainty training, that allows us to train Gaussian mixture models (GMMs) or hidden Markov models (HMMs) directly from noisy data with dynamic uncertainty. We evaluate the potential of this technique for a GMM-based speaker recognition task on speech data corrupted by real-world domestic background noise, using a state-of-the-art signal enhancement technique and various uncertainty estimation techniques as a front-end. Compared to conventional training, the proposed training algorithm results in 3–4% absolute improvement in speaker recognition accuracy by training from either matched, unmatched or multi-condition noisy data. This algorithm is also applicable with minor modifications to maximum a posteriori (MAP) or maximum likelihood linear regression (MLLR) acoustic model adaptation from noisy data and to other data than audio.
Single-channel speech enhancement using spectral subtraction in the short-time modulation domain
2010, Speech Communication
In this paper we investigate the modulation domain as an alternative to the acoustic domain for speech enhancement. More specifically, we wish to determine how competitive the modulation domain is for spectral subtraction as compared to the acoustic domain. For this purpose, we extend the traditional analysis-modification-synthesis framework to include modulation domain processing. We then compensate the noisy modulation spectrum for additive noise distortion by applying the spectral subtraction algorithm in the modulation domain. Using an objective speech quality measure as well as formal subjective listening tests, we show that the proposed method results in improved speech quality. Furthermore, the proposed method achieves better noise suppression than the MMSE method. In this study, the effect of modulation frame duration on speech quality of the proposed enhancement method is also investigated. The results indicate that modulation frame durations of 180–280 ms, provide a good compromise between different types of spectral distortions, namely musical noise and temporal slurring. Thus given a proper selection of modulation frame duration, the proposed modulation spectral subtraction does not suffer from musical noise artifacts typically associated with acoustic spectral subtraction. In order to achieve further improvements in speech quality, we also propose and investigate fusion of modulation spectral subtraction with the MMSE method. The fusion is performed in the short-time spectral domain by combining the magnitude spectra of the above speech enhancement algorithms. Subjective and objective evaluation of the speech enhancement fusion shows consistent speech quality improvements across input SNRs.
Band-pass filtering of the time sequences of spectral parameters for robust wireless speech recognition
2006, Speech Communication
In this paper we address the problem of automatic speech recognition when wireless speech communication systems are involved. In this context, three main sources of distortion should be considered: acoustic environment, speech coding and transmission errors. Whilst the first one has already received a lot of attention, the last two deserve further investigation in our opinion. We have found out that band-pass filtering of the recognition features improves ASR performance when distortions due to these particular communication systems are present. Furthermore, we have evaluated two alternative configurations at different bit error rates (BER) typical of these channels: band-pass filtering the LP-MFCC parameters or a modification of the RASTA-PLP using a sharper low-pass section perform consistently better than LP-MFCC and RASTA-PLP, respectively.
Robust automatic speech recognition with missing and unreliable acoustic data
2001, Speech Communication
Human speech perception is robust in the face of a wide variety of distortions, both experimentally applied and naturally occurring. In these conditions, state-of-the-art automatic speech recognition (ASR) technology fails. This paper describes an approach to robust ASR which acknowledges the fact that some spectro-temporal regions will be dominated by noise. For the purposes of recognition, these regions are treated as missing or unreliable. The primary advantage of this viewpoint is that it makes minimal assumptions about any noise background. Instead, reliable regions are identified, and subsequent decoding is based on this evidence. We introduce two approaches for dealing with unreliable evidence. The first – marginalisation – computes output probabilities on the basis of the reliable evidence only. The second – state-based data imputation – estimates values for the unreliable regions by conditioning on the reliable parts and the recognition hypothesis. A further source of information is the bounds on the energy of any constituent acoustic source in an additive mixture. This additional knowledge can be incorporated into the missing data framework. These approaches are applied to continuous-density hidden Markov model (HMM)-based speech recognisers and evaluated on the TIDigits corpus for several noise conditions. Two criteria which use simple noise estimates are employed as a means of identifying reliable regions. The first treats regions which are negative after spectral subtraction as unreliable. The second uses the estimated noise spectrum to derive local signal-to-noise ratios, which are then thresholded to identify reliable data points. Both marginalisation and state-based data imputation produce a substantial performance advantage over spectral subtraction alone. The use of energy bounds leads to a further increase in performance for both approaches. While marginalisation outperforms data imputation, the latter technique allows the technique to act as a preprocessor for conventional recognisers, or in speech-enhancement applications.
Time and frequency filtering of filter-bank energies for robust HMM speech recognition
2001, Speech Communication
Citation Excerpt :
The number in the parenthesis is the length of the whole filter. Figs. 6(a) and (b) illustrate the simulated modulation spectra of time-filtered speech parameters, which were obtained by using the spectral response of the filter 1/(1−0.97z−1) as an approximation of the mean modulation spectrum (Nadeu and Juang, 1994; Nadeu et al., 1997a). Training was carried out with clean speech.
Every speech recognition system requires a signal representation that parametrically models the temporal evolution of the speech spectral envelope. Current parameterizations involve, either explicitly or implicitly, a set of energies from frequency bands which are often distributed in a mel scale. The computation of those energies is performed in diverse ways, but it always includes smoothing of basic spectral measurements and non-linear amplitude compression. Several linear transformations are then applied to the two-dimensional time-frequency sequence of energies before entering the HMM pattern matching stage. In this paper, a recently introduced technique that consists of filtering that sequence of energies along the frequency dimension is presented, and its resulting parameters are compared with the widely used cepstral coefficients. Then, that frequency filtering transformation is jointly considered with the time filtering transformation that is used to compute dynamic parameters, showing that the flexibility of this combined (tiffing) approach can be used to design a robust set of filters. Recognition experiment results are reported which show the potential of tiffing for an enhanced and more robust HMM speech recognition.
Speaker adaptation based on judge neural networks for real world implementations of Voice-Command systems
2000, Information sciences
A “Voice-Command” system was implemented for isolated word recognition tasks in real-world environments. While the Zero-Crossings with the Peak Amplitudes (ZCPA) model successfully extracted noise-robust features, a new speaker adaptation algorithm was developed to increase recognition accuracy. A multi-layer Perceptron (MLP) was trained to transform the user-specific speech features into those of standard users. This feature transformation was done for each frame, and only a small subset of the word classes was used in the adaptation for the convenience of users. To cope with performance differences between adapted and non-adapted word classes, a simple judge network was introduced and resulted in much better recognition rates for the whole word classes.

View all citing articles on Scopus

¹: This paper is based on a communication presented at the ESCA Conference EUROSPEECH'95 and has been recommended by the EUROSPEECH'95 Scientific Committee.

²: The first part of this work was carried out while the first author was in the former AT&T Bell Laboratories for his sabbatical leave.

View full text

Filtering the time sequences of spectral parameters for speech recognition12