On the efficiency of classical RASTA filtering for continuous speech recognition: Keeping the balance between acoustic pre-processing and acoustic modelling☆
Introduction
The performance of automatic speech recognition (ASR) technology has been improving steadily, not in the least part thanks to improvements in acoustic feature extraction. Especially when the speech signal to be recognised is distorted, powerful pre-processing techniques are needed. In ASR over the telephone, one of the key factors introducing distortions is the communication channel.
The presence of a communication channel can, in principle, introduce four different types of distortions (e.g. Junqua and Haton, 1996). The channel may introduce:
- 1.
additive noise (e.g. telephone line clicks),
- 2.
linear filtering (e.g. the typical band-pass characteristic of 300–3400 Hz for fixed land-line telephony, and the linear frequency response of the handset microphone),
- 3.
non-linear filtering (as may be caused by using a carbon button microphone), and
- 4.
empty signal portions, which are typical for transmission problems in cellular telephony.
In this paper, we will restrict ourselves to the effects of linear filtering introduced by telephone channels. For this type of distortion it is the unpredictability of the linear transfer function that causes ASR performance to deteriorate.
Many different techniques have already been proposed to alleviate the effects of unpredictable linear filtering: cepstrum mean subtraction (Atal, 1974; Furui, 1981), using many different channels during training (e.g. Hirsch et al., 1991; Hermansky and Morgan, 1994; Aikawa et al., 1993; Nadeu et al., 1995; Junqua et al., 1995; de Veth and Boves, 1998), high-pass filtering of parameter tracks (Hirsch et al., 1991), the Gaussian dynamic cepstrum representation (Aikawa et al., 1993), RASTA filtering (Hermansky and Morgan, 1994), Slepian filtering (Nadeu et al., 1995), cepstral-time matrices (Milner and Vaseghi, 1995), phase-corrected RASTA filtering (de Veth and Boves, 1998), signal bias removal (Rahim and Juang, 1996), stochastic matching (Sankar and Lee, 1996), and combinations of these techniques (Junqua et al., 1995). An overview of many of these channel normalisation (CN) techniques can be found in (de Veth et al., 2001).
Given the rich diversity of different CN methods, the question arises which technique is best suited for a particular ASR design and a particular task. One of the factors to be considered is the type of units that are used to represent the speech sounds. In previous research on connected digit recognition, we used whole word models in addition to phone models (de Veth and Boves, 1998). For continuous speech recognition (CSR), however, there is no viable alternative other than some kind of sub-word models. The models can take various forms, yet in our research only continuous density mixture Gaussian hidden Markov models (HMMs) were used.
In this paper, we investigate the gain in recognition performance for four CN techniques in large vocabulary ASR systems that use either context-independent (CI) or context-dependent (CD) sub-word HMMs. The four CN techniques are (1) training on many different channels without additional processing (indicated as ‘no channel normalisation’ (NCN) in the remainder of the paper), (2) cepstral mean subtraction over the complete utterance (indicated as CMS), (3) classical RASTA filtering (indicated by clR), and (4) phase-corrected RASTA (de Veth and Boves, 1996, de Veth and Boves, 1997a, de Veth and Boves, 1997b, de Veth and Boves, 1998) (indicated by pcR). The aim of this paper is to study the interaction between the conventional RASTA filtering technique proposed for improvement of channel robustness on the one hand, and the average number of different left contexts per modelling unit on the other. The results for pcR enable identification of an important factor that limits the efficiency of clR. The results for NCN are included to indicate a baseline recognition performance. The results for CMS are included to indicate the level of what ideally can be achieved when a filtering approach to CN is chosen. So, the results for NCN and CMS will serve to indicate a background against which the results obtained with clR and pcR can be more readily interpreted.
This paper is further organised as follows. In Section 2, the salient features of clR and pcR are discussed, and our research questions are precisely formulated. The telephone speech databases that were used in our experiments are described in Section 3. In Section 4, the signal processing for our experiments is described. The topology of the HMMs, the way these were trained and the recognition task are described in Section 5. In addition, a measure for the average number of left contexts per model is introduced in Section 5. The results of our recognition experiments are presented in Section 6 and discussed in Section 7. Finally, in Section 8, the main conclusions are summarised.
Section snippets
Filter properties
Classical RASTA filtering is an IIR filter operation. The infinite memory of the filter means that a filtered observation value at time t does not only depend on the original observation value at time t, but also on all previous original observation values (i.e. at times t′<t). As a result, the original shape of a sequence of observation values as a function of time is not preserved. This effect is known as the left context dependency of clR (Koehler et al., 1994; Hermansky and Morgan, 1994).
Database DB1
The first database for the experiments was collected with an on-line version of a spoken dialog system that provides public transport information in the Netherlands. This system is an adaptation of a German prototype developed by Philips Research Labs (Steinbiss et al., 1995; Strik et al., 1997). Speakers were recorded over the public switched telephone network in the Netherlands. Speakers, handset and channel characteristics are unknown.
A total of 33,471 utterances was collected and
Signal processing
Speech signals are in A-law format. After conversion to a linear scale, preemphasis with factor 0.98 was applied. A 25 ms Hamming window that was shifted with 10 ms steps was used to calculate 24 filter band energy values for each frame. The 24 triangular shaped filters were uniformly distributed on a mel-frequency scale (covering 0–2143.6 mel). Finally, 12 mel-frequency cepstral coefficients (MFCCs) were derived. In addition to the signal processing performed to obtain twelve MFCCs, (this was
Definition of context-independent models
In all experiments, 33 CI phone models were defined. In addition, two allophones of /l/ and /r/ were defined, for use in pre-vocalic or post-vocalic position. Finally, one model describing all sorts of noise as well as a model for silence were used. The 37 phone models and the noise model consisted of six HMM states; states 2, 4 and 6 shared the emission probability density function with states 1, 3 and 5, respectively. For the silence model a single-state HMM was used. All HMMs were
CI-HMMs for DB1
We trained and tested CI-HMMs for four different conditions: NCN, clR, CMS and pcR. The WER results are shown in Fig. 4 as a function of the total number of Gaussians used. Note that the scale shown in the top indicates the number of Gaussians per state that was used for each HMM configuration. Fig. 4 shows that clR deteriorates recognition performance compared to NCN. Apparently, removing the channel bias with clR introduces amplitude drift to such a degree that the gain from CN is completely
Discussion
In this paper, we investigated recognition configurations differing in the average number of left contexts that were separately modelled. The degree of control in varying the average number of left contexts modelled per phoneme segment R was limited in these experiments. As can be seen in Eq. (6), parameter R depends on the total number of left contexts Cl, the total number of left phone segments being separately modelled Mli, and the total number of clustered left phone segments Mlc. Parameter
Conclusions
In this paper, the efficiency of clR filtering for CN was investigated for CSR based on CI and CD-HMMs. For two different CSR tasks, recognition performance was established for clR filtering, and compared to using no CN, cepstrum mean subtraction and pcR. With pcR, the channel bias is as effectively removed as with clR, while the amplitude drift towards zero introduced by clR is less important (de Veth and Boves, 1998). The study was focussed on whether the differences between clR filtering and
Acknowledgements
This research was funded through the Priority Programme Language and Speech Technology (TST). The TST Programme is sponsored by NWO (Dutch Organization for Scientific Research). The authors would like to thank Mirjam Wester (A2RT) for creating the lexicon used for the Polyphone database and Carsten Meyer (Philips Research, Aachen) for helpful discussions about the CD models.
References (27)
- Aikawa, K., Singer, H., Kawahara, H., Tohkura, Y., 1993. A dynamic cepstrum incorporating time-frequency masking and...
Automatic recognition of speakers from their voices
Proc. IEEE
(1974)- de Veth, J., Boves, L., 1996. Comparison of channel normalisation techniques for automatic speech recognition over the...
- de Veth, J., Boves, L., 1997a. Channel normalisation using phase-corrected RASTA. In: Proc. ESCA-NATO Workshop on...
- de Veth, J., Boves, L., 1997b. Phase-corrected RASTA for automatic speech recognition over the phone. In: Proc....
- et al.
Channel normalization techniques for automatic speech recognition over the telephone
Speech Communication
(1998) - de Veth, J., Cranen, B., Boves, L., 2001. Acoustic features and distance measure to reduce the vulnerability of ASR...
- den Os, E., Boogaart, T., Boves, L., Klabbers, E., 1995. The Dutch Polyphone corpus. Proc. Eurospeech, pp....
- et al.
Effect of temporal envelope smearing on speech reception
J. Acoust. Soc. Amer.
(1994) Cepstral analysis technique for automatic speaker verification
IEEE Trans. Acoust. Speech Signal Process.
(1981)
RASTA processing of speech
IEEE Trans. Speech Audio Process.
Cited by (0)
- ☆
Expanded version of the paper presented at The 5th International Conference on Spoken Language Processing 1998, Sydney, Australia. Short-listed by the Scientific Committee of ICSLP-98 for publication as regular paper in Speech Communication.