Data-driven spectral basis functions for automatic speech recognition
Introduction
A typical large vocabulary automatic speech recognition (ASR) system consists of three main components: feature extraction, pattern classification, and language modeling. Feature extraction attempts to reduce the data rate of raw speech data by alleviating irrelevant variability such as speaker characteristics or environmental noise. The pattern classification module uses statistical models, mainly Hidden Markov models (HMMs), to determine the most likely sequence of sounds that could have generated the observed sequence of feature vectors. The language model steers the classification module to generate word sequences that are more probable.
While many of the early ASR systems used stochastic methods for pattern classification (Vintsyuk, 1968; Jelinek, 1975; Sakoe and Chiba, 1978), some early ASR systems were inspired by the advances in artificial intelligence (Lesser et al., 1975). These systems relied on sets of rules for pattern classification and language modeling. These rules were often prescribed by experts in reading spectrograms and hand crafted for specific recognition problems at hand (Reddy, 1976; Zue, 1990; Mercier et al., 1990). They worked reasonably well for small tasks under controlled environments. The performance of such systems was found to be fragile (Klatt, 1977).
The past two decades have witnessed a significant increase in stochastic approaches in both the pattern classification and the language modeling modules. These stochastic approaches brought the rich mathematical basis that was available in the classical pattern recognition literature to ASR. In the current ASR systems, the pattern classification module uses HMMs (Jelinek, 1997; Rabiner and Juang, 1993) and artificial neural networks (Morgan and Bourlard, 1995) while the language models are in the form of N-grams which are trained from a large text corpus (Katz, 1987). Stochastic techniques typically use only minimal a priori assumptions about the nature of the problem. Such techniques estimate the parameters of models directly from the data. Replacing the hardwired prior knowledge by the knowledge derived from the data turned out to be one of the more significant advances in ASR research.
Although statistical methods are dominant in pattern classification and language modeling, the methods used to derive features for speech recognition are still based on knowledge of the human auditory system. Mainstream systems simulate auditory filters by applying a weighting function on the short-time Fourier spectrum (Mermelstein, 1976; Hermansky, 1990). The output of these weighting functions is then passed through a logarithmic non-linearity and finally projected into the cosine basis functions (discrete cosine transform, DCT).
In this paper we first investigate the optimality of using DCT to extract features from the outputs of a non-uniform filter bank. In Section 2 it will be shown that DCT can at best be equivalent to a principal component transform which preserves the maximum amount of variance while reducing the dimensionality of the feature space. In Section 3 we review the theory behind linear discriminant analysis (LDA) and show in Section 4 that basis functions derived using LDA (designed to preserve maximum phonetic class separability) outperform the DCT (Hermansky and Malayath, 1998). The need for critical-band analysis itself (which implies non-uniform resolution) is investigated in Section 5. Use of critical-band analysis (or Mel filter-bank analysis) is motivated by properties of hearing. This analysis provides higher resolution to the low-frequency region of speech spectrum. We investigate the optimality of such an analysis from the point of view of pattern recognition. Through such an analysis we show that the Mel/Bark-like frequency resolution automatically results from discriminant analysis of short-time Fourier spectrum of speech. We go on to show that this non-uniform resolution can be traced to the physiology of speech production mechanism. There has been work done earlier in relating the Mel/Bark scale to the properties of speech signal and speech production mechanism. Mel-like warping has been shown to be optimal for normalizing speaker variability (Umesh et al., 1997; Kamm et al., 1997). To our knowledge, the Mel/Bark scale was first shown to result from LDA by Hermansky and Malayath (1998). Hunt argued that the advantage of non-uniform spacing of channels is due to the acoustical properties of voiced sounds (Hunt, 1999). A detailed analysis of the relation between speech production and the optimality of non-uniform frequency resolution can be found in (Malayath, 2000).
Section snippets
Feature extraction techniques for ASR
The feature extraction module in ASR typically consists of a series of processing steps as shown in Fig. 1. Some of these steps are inherited from speech coding, and some justified by perceptual or pattern matching arguments. A widely used speech representation is the auditory-like cepstrum (Mermelstein, 1976; Hermansky, 1990). This cepstrum represents an appropriately modified (through auditory-like frequency and amplitude warping and critical-band smoothing) short-time spectrum of speech,
Linear discriminant analysis
In speech recognition, the features extracted from the signal are used to classify the sounds into phonetic categories. Hence, a feature extraction technique should be designed to preserve as much class separability as possible. An ideal feature extractor should be able to reduce the error to its theoretical limit, which is given by Bayes’ error function (Fukunaga, 1990). For an L class problem, the Bayes’ classifier that yields minimum error compares L a posteriori probabilities,
Discriminant analysis of the critical-band spectrum
From Section 2 it is clear that the DCT approximately decorrelates the critical-band spectral features. In this section, the optimality of such a rotation in preserving phonetic discriminability is questioned. In this context it is assumed that phonemes are the basic units for speech recognition. Hence the rotation and dimensionality reduction should be able to preserve the variance introduced by phonemes while suppressing the variance introduced by sources like coarticulation, channel and
Discriminant analysis of short-time Fourier spectrum
In the previous sections the optimality of the DCT applied to the critical-band spectrum was analyzed. It was shown that basis functions derived using LDA perform better than the DCT when both are applied to the critical-band power spectrum. In this section the utility of critical-band analysis itself is investigated by performing discriminant analysis directly on short-time logarithmic Fourier spectrum.
Use of critical-band analysis (or Mel filter-bank analysis) is motivated by properties of
Summary and conclusions
In this paper we showed that stochastic approaches can be used to design feature extraction methods that could provide substantial advantages over conventional feature extraction methods. The advantages of this data-driven approach are the following: (a) Since feature extraction involves reduction in dimensionality, by using data-driven methods for feature extraction we are making sure that the dimensions that are preserved carry maximum amount of useful information. This could improve the
Acknowledgements
The authors would like to thank the reviewers for their comments. This work was supported by DoD (MDA904-98-1-0521 and MDA904-99-1-0044), NSF (IRI-9712579) and by industrial grants from Qualcomm, Intel and Texas Instruments to the Anthropic Signal Processing Group at OGI.
References (53)
Should recognizers have ears?
Speech Comm.
(1998)- et al.
Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition
Speech Comm.
(1998) - et al.
Recognition of speaker-dependent continuous speech with KEAL
- et al.
Digital representations of speech signals
The use of speech knowledge in automatic speech recognition
- et al.
Data based filter design for RASTA-like channel normalization in ASR
- et al.
A discriminatively derived linear transformation for improved speech recognition
- et al.
Feature decorrelation methods in speech recognition. A comparative study
- Brown, P., 1987. The acoustic-modeling problem in automatic speech recognition. Ph.D. Thesis, Carnegie Mellon...
- Cole, R., Noel, M., Lander, T., 1994. Telephone speech corpus development at CSLU. In: Proceedings of the International...
New telephone speech corpora at CSLU
Phonetically sensitive discriminants for improved speech recognition
A comparative study of linear feature transformation techniques for automatic speech recognition
Difference limen for vowel formant frequency
J. Acoust. Soc. Am.
Statistical Pattern Recognition
Cepstral analysis technique for automatic speaker verification
IEEE Trans. Acoust. Speech Signal Process.
Perceptual linear predictive (PLP) analysis of speech
J. Acoust. Soc. Am.
Spectral basis functions from discriminant analysis
A statistical approach to metrics for word and syllable recognition
J. Acoust. Soc. Am.
Speaker dependent and independent speech recognition experiments with an auditory model
A comparison of several acoustic representations for speech recognition with degraded and undegraded speech
An investigation of PLP and IMELDA acoustic representations and of their potential for combination
A sinusoidal family of unitary transforms
IEEE Trans. Pattern Anal. Machine Intell.
Design of a linguistic statistical decoder for recognition of continuous speech
IEEE Trans. Inform. Theory
Statistical Methods for Speech Recognition
Cited by (14)
Optimization of data-driven filterbank for automatic speaker verification
2020, Digital Signal Processing: A Review JournalSpectral-temporal processing of naturalistic sounds in monkeys and humans
2024, Journal of NeurophysiologyAutomated auscultative diagnosis system for evaluation of phonocardiogram signals associated with heart murmur diseases
2018, Gazi University Journal of ScienceThe Walsh-Hadamard transform based automated grading system for monitoring of heart murmurs
2017, 2016 National Conference on Electrical, Electronics and Biomedical Engineering, ELECO 2016Cardiac arrhythmia analysis using Hidden Markov Model and murmur diagnosis
2014, 2014 22nd Signal Processing and Communications Applications Conference, SIU 2014 - Proceedings