Elsevier

Speech Communication

Volume 40, Issue 4, June 2003, Pages 449-466
Speech Communication

Data-driven spectral basis functions for automatic speech recognition

https://doi.org/10.1016/S0167-6393(02)00127-9Get rights and content

Abstract

Feature extraction plays a major role in any form of pattern recognition. Current feature extraction methods used for automatic speech recognition (ASR) and speaker verification rely mainly on properties of speech production (modeled by all-pole filters) and perception (critical-band integration simulated by Mel/Bark filter bank). We propose to use stochastic methods for designing feature extraction methods which are trained to alleviate the unwanted variability present in speech signals. In this paper we show that such data-driven methods provide significant advantages over the conventional methods both in terms of performance of ASR and in providing understanding about the nature of speech signal. The first part of the paper investigates the suitability of the cepstral features obtained by applying discrete cosine transform on logarithmic critical-band power spectra. An alternate set of basis functions were designed by linear discriminant analysis (LDA) of logarithmic critical-band power spectra. Discriminant features extracted by these alternate basis functions are shown to outperform the cepstral features in ASR experiments. The second part of the paper discusses the relevance of non-uniform frequency resolution used by current speech analysis methods like Mel frequency analysis and perceptual linear predictive analysis. It is shown that LDA of the short-time Fourier spectrum of speech yields spectral basis functions which provide comparatively lower resolution to the high-frequency region of spectrum. This is consistent with critical-band resolution and is shown to be caused by the spectral properties of vowel sounds.

Introduction

A typical large vocabulary automatic speech recognition (ASR) system consists of three main components: feature extraction, pattern classification, and language modeling. Feature extraction attempts to reduce the data rate of raw speech data by alleviating irrelevant variability such as speaker characteristics or environmental noise. The pattern classification module uses statistical models, mainly Hidden Markov models (HMMs), to determine the most likely sequence of sounds that could have generated the observed sequence of feature vectors. The language model steers the classification module to generate word sequences that are more probable.

While many of the early ASR systems used stochastic methods for pattern classification (Vintsyuk, 1968; Jelinek, 1975; Sakoe and Chiba, 1978), some early ASR systems were inspired by the advances in artificial intelligence (Lesser et al., 1975). These systems relied on sets of rules for pattern classification and language modeling. These rules were often prescribed by experts in reading spectrograms and hand crafted for specific recognition problems at hand (Reddy, 1976; Zue, 1990; Mercier et al., 1990). They worked reasonably well for small tasks under controlled environments. The performance of such systems was found to be fragile (Klatt, 1977).

The past two decades have witnessed a significant increase in stochastic approaches in both the pattern classification and the language modeling modules. These stochastic approaches brought the rich mathematical basis that was available in the classical pattern recognition literature to ASR. In the current ASR systems, the pattern classification module uses HMMs (Jelinek, 1997; Rabiner and Juang, 1993) and artificial neural networks (Morgan and Bourlard, 1995) while the language models are in the form of N-grams which are trained from a large text corpus (Katz, 1987). Stochastic techniques typically use only minimal a priori assumptions about the nature of the problem. Such techniques estimate the parameters of models directly from the data. Replacing the hardwired prior knowledge by the knowledge derived from the data turned out to be one of the more significant advances in ASR research.

Although statistical methods are dominant in pattern classification and language modeling, the methods used to derive features for speech recognition are still based on knowledge of the human auditory system. Mainstream systems simulate auditory filters by applying a weighting function on the short-time Fourier spectrum (Mermelstein, 1976; Hermansky, 1990). The output of these weighting functions is then passed through a logarithmic non-linearity and finally projected into the cosine basis functions (discrete cosine transform, DCT).

In this paper we first investigate the optimality of using DCT to extract features from the outputs of a non-uniform filter bank. In Section 2 it will be shown that DCT can at best be equivalent to a principal component transform which preserves the maximum amount of variance while reducing the dimensionality of the feature space. In Section 3 we review the theory behind linear discriminant analysis (LDA) and show in Section 4 that basis functions derived using LDA (designed to preserve maximum phonetic class separability) outperform the DCT (Hermansky and Malayath, 1998). The need for critical-band analysis itself (which implies non-uniform resolution) is investigated in Section 5. Use of critical-band analysis (or Mel filter-bank analysis) is motivated by properties of hearing. This analysis provides higher resolution to the low-frequency region of speech spectrum. We investigate the optimality of such an analysis from the point of view of pattern recognition. Through such an analysis we show that the Mel/Bark-like frequency resolution automatically results from discriminant analysis of short-time Fourier spectrum of speech. We go on to show that this non-uniform resolution can be traced to the physiology of speech production mechanism. There has been work done earlier in relating the Mel/Bark scale to the properties of speech signal and speech production mechanism. Mel-like warping has been shown to be optimal for normalizing speaker variability (Umesh et al., 1997; Kamm et al., 1997). To our knowledge, the Mel/Bark scale was first shown to result from LDA by Hermansky and Malayath (1998). Hunt argued that the advantage of non-uniform spacing of channels is due to the acoustical properties of voiced sounds (Hunt, 1999). A detailed analysis of the relation between speech production and the optimality of non-uniform frequency resolution can be found in (Malayath, 2000).

Section snippets

Feature extraction techniques for ASR

The feature extraction module in ASR typically consists of a series of processing steps as shown in Fig. 1. Some of these steps are inherited from speech coding, and some justified by perceptual or pattern matching arguments. A widely used speech representation is the auditory-like cepstrum (Mermelstein, 1976; Hermansky, 1990). This cepstrum represents an appropriately modified (through auditory-like frequency and amplitude warping and critical-band smoothing) short-time spectrum of speech,

Linear discriminant analysis

In speech recognition, the features extracted from the signal are used to classify the sounds into phonetic categories. Hence, a feature extraction technique should be designed to preserve as much class separability as possible. An ideal feature extractor should be able to reduce the error to its theoretical limit, which is given by Bayes’ error function (Fukunaga, 1990). For an L class problem, the Bayes’ classifier that yields minimum error compares L a posteriori probabilities, P(ω1|x),P(ω2|x

Discriminant analysis of the critical-band spectrum

From Section 2 it is clear that the DCT approximately decorrelates the critical-band spectral features. In this section, the optimality of such a rotation in preserving phonetic discriminability is questioned. In this context it is assumed that phonemes are the basic units for speech recognition. Hence the rotation and dimensionality reduction should be able to preserve the variance introduced by phonemes while suppressing the variance introduced by sources like coarticulation, channel and

Discriminant analysis of short-time Fourier spectrum

In the previous sections the optimality of the DCT applied to the critical-band spectrum was analyzed. It was shown that basis functions derived using LDA perform better than the DCT when both are applied to the critical-band power spectrum. In this section the utility of critical-band analysis itself is investigated by performing discriminant analysis directly on short-time logarithmic Fourier spectrum.

Use of critical-band analysis (or Mel filter-bank analysis) is motivated by properties of

Summary and conclusions

In this paper we showed that stochastic approaches can be used to design feature extraction methods that could provide substantial advantages over conventional feature extraction methods. The advantages of this data-driven approach are the following: (a) Since feature extraction involves reduction in dimensionality, by using data-driven methods for feature extraction we are making sure that the dimensions that are preserved carry maximum amount of useful information. This could improve the

Acknowledgements

The authors would like to thank the reviewers for their comments. This work was supported by DoD (MDA904-98-1-0521 and MDA904-99-1-0044), NSF (IRI-9712579) and by industrial grants from Qualcomm, Intel and Texas Instruments to the Anthropic Signal Processing Group at OGI.

References (53)

  • R. Cole et al.

    New telephone speech corpora at CSLU

  • G.R. Doddington

    Phonetically sensitive discriminants for improved speech recognition

  • T. Eisele et al.

    A comparative study of linear feature transformation techniques for automatic speech recognition

  • J. Flanagan

    Difference limen for vowel formant frequency

    J. Acoust. Soc. Am.

    (1955)
  • K. Fukunaga

    Statistical Pattern Recognition

    (1990)
  • S. Furui

    Cepstral analysis technique for automatic speaker verification

    IEEE Trans. Acoust. Speech Signal Process.

    (1981)
  • H. Hermansky

    Perceptual linear predictive (PLP) analysis of speech

    J. Acoust. Soc. Am.

    (1990)
  • H. Hermansky et al.

    Spectral basis functions from discriminant analysis

  • M. Hunt

    A statistical approach to metrics for word and syllable recognition

    J. Acoust. Soc. Am.

    (1979)
  • Hunt, M., 1999. Spectral signal processing for ASR. In: Automatic Speech Recognition and Understanding Workshop...
  • M. Hunt et al.

    Speaker dependent and independent speech recognition experiments with an auditory model

  • M. Hunt et al.

    A comparison of several acoustic representations for speech recognition with degraded and undegraded speech

  • M. Hunt et al.

    An investigation of PLP and IMELDA acoustic representations and of their potential for combination

  • A.K. Jain

    A sinusoidal family of unitary transforms

    IEEE Trans. Pattern Anal. Machine Intell.

    (1979)
  • F. Jelinek

    Design of a linguistic statistical decoder for recognition of continuous speech

    IEEE Trans. Inform. Theory

    (1975)
  • F. Jelinek

    Statistical Methods for Speech Recognition

    (1997)
  • Cited by (14)

    View all citing articles on Scopus
    View full text