Elsevier

Speech Communication

Volume 39, Issues 3–4, February 2003, Pages 269-286
Speech Communication

On the efficiency of classical RASTA filtering for continuous speech recognition: Keeping the balance between acoustic pre-processing and acoustic modelling

https://doi.org/10.1016/S0167-6393(02)00030-4Get rights and content

Abstract

The efficiency of classical RASTA filtering for channel normalisation was investigated for continuous speech recognition based on context-independent and context-dependent hidden Markov models. For a medium and a large vocabulary continuous speech recognition task, recognition performance was established for classical RASTA filtering and compared to using no channel normalisation, cepstrum mean normalisation, and phase-corrected RASTA. Phase-corrected RASTA is a technique that consists of classical RASTA filtering followed by a phase correction operation. In this manner, channel bias is as effectively removed as with classical RASTA. However, for phase-corrected RASTA, amplitude drift towards zero in stationary signal portions is diminished compared to classical RASTA. The results show that application of classical RASTA filtering resulted in decreased recognition performance when compared to using no channel normalisation for all conditions studied, although the decrease appeared to be smaller for context-dependent models than for context-independent models. However, for all conditions, recognition performance was significantly and substantially improved when phase-corrected RASTA was used and reached the same performance level as obtained for cepstrum mean normalisation in some cases. It is concluded that classical RASTA filtering can only be effective for channel robustness, if the impact of the amplitude drift towards zero can be kept as limited as possible.

Introduction

The performance of automatic speech recognition (ASR) technology has been improving steadily, not in the least part thanks to improvements in acoustic feature extraction. Especially when the speech signal to be recognised is distorted, powerful pre-processing techniques are needed. In ASR over the telephone, one of the key factors introducing distortions is the communication channel.

The presence of a communication channel can, in principle, introduce four different types of distortions (e.g. Junqua and Haton, 1996). The channel may introduce:

  • 1.

    additive noise (e.g. telephone line clicks),

  • 2.

    linear filtering (e.g. the typical band-pass characteristic of 300–3400 Hz for fixed land-line telephony, and the linear frequency response of the handset microphone),

  • 3.

    non-linear filtering (as may be caused by using a carbon button microphone), and

  • 4.

    empty signal portions, which are typical for transmission problems in cellular telephony.


In this paper, we will restrict ourselves to the effects of linear filtering introduced by telephone channels. For this type of distortion it is the unpredictability of the linear transfer function that causes ASR performance to deteriorate.

Many different techniques have already been proposed to alleviate the effects of unpredictable linear filtering: cepstrum mean subtraction (Atal, 1974; Furui, 1981), using many different channels during training (e.g. Hirsch et al., 1991; Hermansky and Morgan, 1994; Aikawa et al., 1993; Nadeu et al., 1995; Junqua et al., 1995; de Veth and Boves, 1998), high-pass filtering of parameter tracks (Hirsch et al., 1991), the Gaussian dynamic cepstrum representation (Aikawa et al., 1993), RASTA filtering (Hermansky and Morgan, 1994), Slepian filtering (Nadeu et al., 1995), cepstral-time matrices (Milner and Vaseghi, 1995), phase-corrected RASTA filtering (de Veth and Boves, 1998), signal bias removal (Rahim and Juang, 1996), stochastic matching (Sankar and Lee, 1996), and combinations of these techniques (Junqua et al., 1995). An overview of many of these channel normalisation (CN) techniques can be found in (de Veth et al., 2001).

Given the rich diversity of different CN methods, the question arises which technique is best suited for a particular ASR design and a particular task. One of the factors to be considered is the type of units that are used to represent the speech sounds. In previous research on connected digit recognition, we used whole word models in addition to phone models (de Veth and Boves, 1998). For continuous speech recognition (CSR), however, there is no viable alternative other than some kind of sub-word models. The models can take various forms, yet in our research only continuous density mixture Gaussian hidden Markov models (HMMs) were used.

In this paper, we investigate the gain in recognition performance for four CN techniques in large vocabulary ASR systems that use either context-independent (CI) or context-dependent (CD) sub-word HMMs. The four CN techniques are (1) training on many different channels without additional processing (indicated as ‘no channel normalisation’ (NCN) in the remainder of the paper), (2) cepstral mean subtraction over the complete utterance (indicated as CMS), (3) classical RASTA filtering (indicated by clR), and (4) phase-corrected RASTA (de Veth and Boves, 1996, de Veth and Boves, 1997a, de Veth and Boves, 1997b, de Veth and Boves, 1998) (indicated by pcR). The aim of this paper is to study the interaction between the conventional RASTA filtering technique proposed for improvement of channel robustness on the one hand, and the average number of different left contexts per modelling unit on the other. The results for pcR enable identification of an important factor that limits the efficiency of clR. The results for NCN are included to indicate a baseline recognition performance. The results for CMS are included to indicate the level of what ideally can be achieved when a filtering approach to CN is chosen. So, the results for NCN and CMS will serve to indicate a background against which the results obtained with clR and pcR can be more readily interpreted.

This paper is further organised as follows. In Section 2, the salient features of clR and pcR are discussed, and our research questions are precisely formulated. The telephone speech databases that were used in our experiments are described in Section 3. In Section 4, the signal processing for our experiments is described. The topology of the HMMs, the way these were trained and the recognition task are described in Section 5. In addition, a measure for the average number of left contexts per model is introduced in Section 5. The results of our recognition experiments are presented in Section 6 and discussed in Section 7. Finally, in Section 8, the main conclusions are summarised.

Section snippets

Filter properties

Classical RASTA filtering is an IIR filter operation. The infinite memory of the filter means that a filtered observation value at time t does not only depend on the original observation value at time t, but also on all previous original observation values (i.e. at times t<t). As a result, the original shape of a sequence of observation values as a function of time is not preserved. This effect is known as the left context dependency of clR (Koehler et al., 1994; Hermansky and Morgan, 1994).

Database DB1

The first database for the experiments was collected with an on-line version of a spoken dialog system that provides public transport information in the Netherlands. This system is an adaptation of a German prototype developed by Philips Research Labs (Steinbiss et al., 1995; Strik et al., 1997). Speakers were recorded over the public switched telephone network in the Netherlands. Speakers, handset and channel characteristics are unknown.

A total of 33,471 utterances was collected and

Signal processing

Speech signals are in A-law format. After conversion to a linear scale, preemphasis with factor 0.98 was applied. A 25 ms Hamming window that was shifted with 10 ms steps was used to calculate 24 filter band energy values for each frame. The 24 triangular shaped filters were uniformly distributed on a mel-frequency scale (covering 0–2143.6 mel). Finally, 12 mel-frequency cepstral coefficients (MFCCs) were derived. In addition to the signal processing performed to obtain twelve MFCCs, (this was

Definition of context-independent models

In all experiments, 33 CI phone models were defined. In addition, two allophones of /l/ and /r/ were defined, for use in pre-vocalic or post-vocalic position. Finally, one model describing all sorts of noise as well as a model for silence were used. The 37 phone models and the noise model consisted of six HMM states; states 2, 4 and 6 shared the emission probability density function with states 1, 3 and 5, respectively. For the silence model a single-state HMM was used. All HMMs were

CI-HMMs for DB1

We trained and tested CI-HMMs for four different conditions: NCN, clR, CMS and pcR. The WER results are shown in Fig. 4 as a function of the total number of Gaussians used. Note that the scale shown in the top indicates the number of Gaussians per state that was used for each HMM configuration. Fig. 4 shows that clR deteriorates recognition performance compared to NCN. Apparently, removing the channel bias with clR introduces amplitude drift to such a degree that the gain from CN is completely

Discussion

In this paper, we investigated recognition configurations differing in the average number of left contexts that were separately modelled. The degree of control in varying the average number of left contexts modelled per phoneme segment R was limited in these experiments. As can be seen in Eq. (6), parameter R depends on the total number of left contexts Cl, the total number of left phone segments being separately modelled Mli, and the total number of clustered left phone segments Mlc. Parameter

Conclusions

In this paper, the efficiency of clR filtering for CN was investigated for CSR based on CI and CD-HMMs. For two different CSR tasks, recognition performance was established for clR filtering, and compared to using no CN, cepstrum mean subtraction and pcR. With pcR, the channel bias is as effectively removed as with clR, while the amplitude drift towards zero introduced by clR is less important (de Veth and Boves, 1998). The study was focussed on whether the differences between clR filtering and

Acknowledgements

This research was funded through the Priority Programme Language and Speech Technology (TST). The TST Programme is sponsored by NWO (Dutch Organization for Scientific Research). The authors would like to thank Mirjam Wester (A2RT) for creating the lexicon used for the Polyphone database and Carsten Meyer (Philips Research, Aachen) for helpful discussions about the CD models.

References (27)

  • Aikawa, K., Singer, H., Kawahara, H., Tohkura, Y., 1993. A dynamic cepstrum incorporating time-frequency masking and...
  • B. Atal

    Automatic recognition of speakers from their voices

    Proc. IEEE

    (1974)
  • de Veth, J., Boves, L., 1996. Comparison of channel normalisation techniques for automatic speech recognition over the...
  • de Veth, J., Boves, L., 1997a. Channel normalisation using phase-corrected RASTA. In: Proc. ESCA-NATO Workshop on...
  • de Veth, J., Boves, L., 1997b. Phase-corrected RASTA for automatic speech recognition over the phone. In: Proc....
  • J. de Veth et al.

    Channel normalization techniques for automatic speech recognition over the telephone

    Speech Communication

    (1998)
  • de Veth, J., Cranen, B., Boves, L., 2001. Acoustic features and distance measure to reduce the vulnerability of ASR...
  • den Os, E., Boogaart, T., Boves, L., Klabbers, E., 1995. The Dutch Polyphone corpus. Proc. Eurospeech, pp....
  • R. Drullman et al.

    Effect of temporal envelope smearing on speech reception

    J. Acoust. Soc. Amer.

    (1994)
  • S. Furui

    Cepstral analysis technique for automatic speaker verification

    IEEE Trans. Acoust. Speech Signal Process.

    (1981)
  • Hermansky, H., 1997. Should recognizers have ears? Proc. ESCA-NATO Workshop on Robust Speech Recognition for Unknown...
  • H. Hermansky et al.

    RASTA processing of speech

    IEEE Trans. Speech Audio Process.

    (1994)
  • Hermansky, H., Pavel, M., 1995. Psychophysics of speech engineering systems. In: Proc. Internat. Conf. Phon. Sc., pp....
  • Cited by (0)

    Expanded version of the paper presented at The 5th International Conference on Spoken Language Processing 1998, Sydney, Australia. Short-listed by the Scientific Committee of ICSLP-98 for publication as regular paper in Speech Communication.

    View full text