Data-driven spectral basis functions for automatic speech recognition

doi:10.1016/S0167-6393(02)00127-9

Speech Communication

Volume 40, Issue 4, June 2003, Pages 449-466

https://doi.org/10.1016/S0167-6393(02)00127-9 Get rights and content

Abstract

Feature extraction plays a major role in any form of pattern recognition. Current feature extraction methods used for automatic speech recognition (ASR) and speaker verification rely mainly on properties of speech production (modeled by all-pole filters) and perception (critical-band integration simulated by Mel/Bark filter bank). We propose to use stochastic methods for designing feature extraction methods which are trained to alleviate the unwanted variability present in speech signals. In this paper we show that such data-driven methods provide significant advantages over the conventional methods both in terms of performance of ASR and in providing understanding about the nature of speech signal. The first part of the paper investigates the suitability of the cepstral features obtained by applying discrete cosine transform on logarithmic critical-band power spectra. An alternate set of basis functions were designed by linear discriminant analysis (LDA) of logarithmic critical-band power spectra. Discriminant features extracted by these alternate basis functions are shown to outperform the cepstral features in ASR experiments. The second part of the paper discusses the relevance of non-uniform frequency resolution used by current speech analysis methods like Mel frequency analysis and perceptual linear predictive analysis. It is shown that LDA of the short-time Fourier spectrum of speech yields spectral basis functions which provide comparatively lower resolution to the high-frequency region of spectrum. This is consistent with critical-band resolution and is shown to be caused by the spectral properties of vowel sounds.

Introduction

A typical large vocabulary automatic speech recognition (ASR) system consists of three main components: feature extraction, pattern classification, and language modeling. Feature extraction attempts to reduce the data rate of raw speech data by alleviating irrelevant variability such as speaker characteristics or environmental noise. The pattern classification module uses statistical models, mainly Hidden Markov models (HMMs), to determine the most likely sequence of sounds that could have generated the observed sequence of feature vectors. The language model steers the classification module to generate word sequences that are more probable.

While many of the early ASR systems used stochastic methods for pattern classification (Vintsyuk, 1968; Jelinek, 1975; Sakoe and Chiba, 1978), some early ASR systems were inspired by the advances in artificial intelligence (Lesser et al., 1975). These systems relied on sets of rules for pattern classification and language modeling. These rules were often prescribed by experts in reading spectrograms and hand crafted for specific recognition problems at hand (Reddy, 1976; Zue, 1990; Mercier et al., 1990). They worked reasonably well for small tasks under controlled environments. The performance of such systems was found to be fragile (Klatt, 1977).

The past two decades have witnessed a significant increase in stochastic approaches in both the pattern classification and the language modeling modules. These stochastic approaches brought the rich mathematical basis that was available in the classical pattern recognition literature to ASR. In the current ASR systems, the pattern classification module uses HMMs (Jelinek, 1997; Rabiner and Juang, 1993) and artificial neural networks (Morgan and Bourlard, 1995) while the language models are in the form of N-grams which are trained from a large text corpus (Katz, 1987). Stochastic techniques typically use only minimal a priori assumptions about the nature of the problem. Such techniques estimate the parameters of models directly from the data. Replacing the hardwired prior knowledge by the knowledge derived from the data turned out to be one of the more significant advances in ASR research.

Although statistical methods are dominant in pattern classification and language modeling, the methods used to derive features for speech recognition are still based on knowledge of the human auditory system. Mainstream systems simulate auditory filters by applying a weighting function on the short-time Fourier spectrum (Mermelstein, 1976; Hermansky, 1990). The output of these weighting functions is then passed through a logarithmic non-linearity and finally projected into the cosine basis functions (discrete cosine transform, DCT).

In this paper we first investigate the optimality of using DCT to extract features from the outputs of a non-uniform filter bank. In Section 2 it will be shown that DCT can at best be equivalent to a principal component transform which preserves the maximum amount of variance while reducing the dimensionality of the feature space. In Section 3 we review the theory behind linear discriminant analysis (LDA) and show in Section 4 that basis functions derived using LDA (designed to preserve maximum phonetic class separability) outperform the DCT (Hermansky and Malayath, 1998). The need for critical-band analysis itself (which implies non-uniform resolution) is investigated in Section 5. Use of critical-band analysis (or Mel filter-bank analysis) is motivated by properties of hearing. This analysis provides higher resolution to the low-frequency region of speech spectrum. We investigate the optimality of such an analysis from the point of view of pattern recognition. Through such an analysis we show that the Mel/Bark-like frequency resolution automatically results from discriminant analysis of short-time Fourier spectrum of speech. We go on to show that this non-uniform resolution can be traced to the physiology of speech production mechanism. There has been work done earlier in relating the Mel/Bark scale to the properties of speech signal and speech production mechanism. Mel-like warping has been shown to be optimal for normalizing speaker variability (Umesh et al., 1997; Kamm et al., 1997). To our knowledge, the Mel/Bark scale was first shown to result from LDA by Hermansky and Malayath (1998). Hunt argued that the advantage of non-uniform spacing of channels is due to the acoustical properties of voiced sounds (Hunt, 1999). A detailed analysis of the relation between speech production and the optimality of non-uniform frequency resolution can be found in (Malayath, 2000).

Section snippets

Feature extraction techniques for ASR

The feature extraction module in ASR typically consists of a series of processing steps as shown in Fig. 1. Some of these steps are inherited from speech coding, and some justified by perceptual or pattern matching arguments. A widely used speech representation is the auditory-like cepstrum (Mermelstein, 1976; Hermansky, 1990). This cepstrum represents an appropriately modified (through auditory-like frequency and amplitude warping and critical-band smoothing) short-time spectrum of speech,

Linear discriminant analysis

In speech recognition, the features extracted from the signal are used to classify the sounds into phonetic categories. Hence, a feature extraction technique should be designed to preserve as much class separability as possible. An ideal feature extractor should be able to reduce the error to its theoretical limit, which is given by Bayes’ error function (Fukunaga, 1990). For an L class problem, the Bayes’ classifier that yields minimum error compares L a posteriori probabilities, $P(ω_{1} | x),P(ω_{2} | x$

Discriminant analysis of the critical-band spectrum

From Section 2 it is clear that the DCT approximately decorrelates the critical-band spectral features. In this section, the optimality of such a rotation in preserving phonetic discriminability is questioned. In this context it is assumed that phonemes are the basic units for speech recognition. Hence the rotation and dimensionality reduction should be able to preserve the variance introduced by phonemes while suppressing the variance introduced by sources like coarticulation, channel and

Discriminant analysis of short-time Fourier spectrum

In the previous sections the optimality of the DCT applied to the critical-band spectrum was analyzed. It was shown that basis functions derived using LDA perform better than the DCT when both are applied to the critical-band power spectrum. In this section the utility of critical-band analysis itself is investigated by performing discriminant analysis directly on short-time logarithmic Fourier spectrum.

Use of critical-band analysis (or Mel filter-bank analysis) is motivated by properties of

Summary and conclusions

In this paper we showed that stochastic approaches can be used to design feature extraction methods that could provide substantial advantages over conventional feature extraction methods. The advantages of this data-driven approach are the following: (a) Since feature extraction involves reduction in dimensionality, by using data-driven methods for feature extraction we are making sure that the dimensions that are preserved carry maximum amount of useful information. This could improve the

Acknowledgements

The authors would like to thank the reviewers for their comments. This work was supported by DoD (MDA904-98-1-0521 and MDA904-99-1-0044), NSF (IRI-9712579) and by industrial grants from Qualcomm, Intel and Texas Instruments to the Anthropic Signal Processing Group at OGI.

References (53)

H. Hermansky
Should recognizers have ears?
Speech Comm.
(1998)
N. Kumar et al.
Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition
Speech Comm.
(1998)
G. Mercier et al.
Recognition of speaker-dependent continuous speech with KEAL
R.W. Schafer et al.
Digital representations of speech signals
V.W. Zue
The use of speech knowledge in automatic speech recognition
C. Avendano et al.
Data based filter design for RASTA-like channel normalization in ASR
C.M. Ayer et al.
A discriminatively derived linear transformation for improved speech recognition
E. Batlle et al.
Feature decorrelation methods in speech recognition. A comparative study
Brown, P., 1987. The acoustic-modeling problem in automatic speech recognition. Ph.D. Thesis, Carnegie Mellon...
Cole, R., Noel, M., Lander, T., 1994. Telephone speech corpus development at CSLU. In: Proceedings of the International...

R. Cole et al.

New telephone speech corpora at CSLU

G.R. Doddington

Phonetically sensitive discriminants for improved speech recognition

T. Eisele et al.

A comparative study of linear feature transformation techniques for automatic speech recognition

J. Flanagan

Difference limen for vowel formant frequency

J. Acoust. Soc. Am.

(1955)

K. Fukunaga

Statistical Pattern Recognition

(1990)

S. Furui

Cepstral analysis technique for automatic speaker verification

IEEE Trans. Acoust. Speech Signal Process.

(1981)

H. Hermansky

Perceptual linear predictive (PLP) analysis of speech

J. Acoust. Soc. Am.

(1990)

H. Hermansky et al.

Spectral basis functions from discriminant analysis

M. Hunt

A statistical approach to metrics for word and syllable recognition

J. Acoust. Soc. Am.

(1979)

Hunt, M., 1999. Spectral signal processing for ASR. In: Automatic Speech Recognition and Understanding Workshop...

M. Hunt et al.

Speaker dependent and independent speech recognition experiments with an auditory model

M. Hunt et al.

A comparison of several acoustic representations for speech recognition with degraded and undegraded speech

M. Hunt et al.

An investigation of PLP and IMELDA acoustic representations and of their potential for combination

A.K. Jain

A sinusoidal family of unitary transforms

IEEE Trans. Pattern Anal. Machine Intell.

(1979)

F. Jelinek

Design of a linguistic statistical decoder for recognition of continuous speech

IEEE Trans. Inform. Theory

(1975)

F. Jelinek

Statistical Methods for Speech Recognition

(1997)

Cited by (14)

Optimization of data-driven filterbank for automatic speaker verification
2020, Digital Signal Processing: A Review Journal
Most of the speech processing applications use triangular filters spaced in mel-scale for feature extraction. In this paper, we propose a new data-driven filter design method which optimizes filter parameters from a given speech data. First, we introduce a frame-selection based approach for developing speech-signal-based frequency warping scale. Then, we propose a new method for computing the filter frequency responses by using principal component analysis (PCA). The main advantage of the proposed method over the recently introduced deep learning based methods is that it requires very limited amount of unlabeled speech-data. We demonstrate that the proposed filterbank has more speaker discriminative power than commonly used mel filterbank as well as existing data-driven filterbank. We conduct automatic speaker verification (ASV) experiments with different corpora using various classifier back-ends. We show that the acoustic features created with proposed filterbank are better than existing mel-frequency cepstral coefficients (MFCCs) and speech-signal-based frequency cepstral coefficients (SFCCs) in most cases. In the experiments with VoxCeleb1 and popular i-vector back-end, we observe 9.75% relative improvement in equal error rate (EER) over MFCCs. Similarly, the relative improvement is 4.43% with recently introduced x-vector system. We obtain further improvement using fusion of the proposed method with standard MFCC-based approach.
Spectral-temporal processing of naturalistic sounds in monkeys and humans
2024, Journal of Neurophysiology
Penetration State Identification of Aluminum Alloy Cold Metal Transfer Based on Arc Sound Signals Using Multi-Spectrogram Fusion Inception Convolutional Neural Network
2023, Electronics (Switzerland)
Automated auscultative diagnosis system for evaluation of phonocardiogram signals associated with heart murmur diseases
2018, Gazi University Journal of Science
The Walsh-Hadamard transform based automated grading system for monitoring of heart murmurs
2017, 2016 National Conference on Electrical, Electronics and Biomedical Engineering, ELECO 2016
Cardiac arrhythmia analysis using Hidden Markov Model and murmur diagnosis
2014, 2014 22nd Signal Processing and Communications Applications Conference, SIU 2014 - Proceedings

View all citing articles on Scopus

View full text

Data-driven spectral basis functions for automatic speech recognition

Abstract

Introduction

Section snippets

Feature extraction techniques for ASR

Linear discriminant analysis

Discriminant analysis of the critical-band spectrum

Discriminant analysis of short-time Fourier spectrum

Summary and conclusions

Acknowledgements

Speech Comm.

Speech Comm.

Data based filter design for RASTA-like channel normalization in ASR

A discriminatively derived linear transformation for improved speech recognition

Feature decorrelation methods in speech recognition. A comparative study

New telephone speech corpora at CSLU

Phonetically sensitive discriminants for improved speech recognition

A comparative study of linear feature transformation techniques for automatic speech recognition

Difference limen for vowel formant frequency

J. Acoust. Soc. Am.

Statistical Pattern Recognition

Cepstral analysis technique for automatic speaker verification

IEEE Trans. Acoust. Speech Signal Process.

Perceptual linear predictive (PLP) analysis of speech

J. Acoust. Soc. Am.

Spectral basis functions from discriminant analysis

A statistical approach to metrics for word and syllable recognition

J. Acoust. Soc. Am.

Speaker dependent and independent speech recognition experiments with an auditory model

A comparison of several acoustic representations for speech recognition with degraded and undegraded speech

An investigation of PLP and IMELDA acoustic representations and of their potential for combination

A sinusoidal family of unitary transforms

IEEE Trans. Pattern Anal. Machine Intell.

Design of a linguistic statistical decoder for recognition of continuous speech

IEEE Trans. Inform. Theory

Statistical Methods for Speech Recognition