Techniques for handling convolutional distortion with `missing data' automatic speech recognition

doi:10.1016/j.specom.2004.02.005

Speech Communication

Volume 43, Issues 1–2, June 2004, Pages 123-142

https://doi.org/10.1016/j.specom.2004.02.005 Get rights and content

Abstract

In this study we describe two techniques for handling convolutional distortion with `missing data' speech recognition using spectral features. The missing data approach to automatic speech recognition (ASR) is motivated by a model of human speech perception, and involves the modification of a hidden Markov model (HMM) classifier to deal with missing or unreliable features. Although the missing data paradigm was proposed as a means of handling additive noise in ASR, we demonstrate that it can also be effective in dealing with convolutional distortion. Firstly, we propose a normalisation technique for handling spectral distortions and changes of input level (possibly in the presence of additive noise). The technique computes a normalising factor only from the most intense regions of the speech spectrum, which are likely to remain intact across various noise conditions. We show that the proposed normalisation method improves performance compared to a conventional missing data approach with spectrally distorted and noise contaminated speech, and in conditions where the gain of the input signal varies. Secondly, we propose a method for handling reverberated speech which attempts to identify time-frequency regions that are not badly contaminated by reverberation and have strong speech energy. This is achieved by using modulation filtering to identify `reliable' regions of the speech spectrum. We demonstrate that our approach improves recognition performance in cases where the reverberation time T₆₀ exceeds 0.7 s, compared to a baseline system which uses acoustic features derived from perceptual linear prediction and the modulation-filtered spectrogram.

Introduction

Although much research effort has been expended on the development of automatic speech recognition (ASR) systems, their performance still remains far from that of human listeners. In particular, human speech perception is robust when speech is corrupted by noise or by other environmental interference, such as reverberation or a poor transmission line (for example, see Assmann and Summerfield, 2003; Nabelek and Robinson, 1982). In contrast, ASR performance falls dramatically in such conditions (for a comparative review of human and automatic speech recognition performance in noise see Lippmann, 1997). As several researchers have observed (e.g., Cooke et al., 2001; Hermansky, 1998; Lippmann, 1997), the current limitations of ASR systems might reflect our limited understanding of human speech perception, and especially our inadequate technological replication of the underlying processes.

The robustness of human speech perception can be attributed to two main factors. First, listeners are able to segregate complex acoustic mixtures in order to extract a description of a target sound source (such as the voice of a speaker). Bregman (1990) describes this process as `auditory scene analysis'. Secondly, human speech perception is robust even when speech is partly masked by noise, or when parts of the acoustic spectrum are removed altogether (for example, by a bandlimited communications channel). Cooke et al. (2001) have interpreted this ability in terms of a `missing data' model of speech recognition, and have adapted a hidden Markov model (HMM) classifier to deal with missing or unreliable features. In their system, a time-frequency `mask' is employed to indicate whether acoustic features are reliable or corrupted; according to this division the features are treated differently by the recogniser. Typically, the missing data mask is derived from auditory-motivated processing, such as pitch analysis (Barker et al., 2001a; Brown et al., 2001) or binaural spatial processing (Palomäki et al., 2001, Palomäki et al., in press). Alternatively, the mask can be set according to local estimates of the signal-to-noise ratio (SNR) (Cooke et al., 2001).

The missing data paradigm was conceived by Cooke et al. as a means of dealing with additive noise in ASR. As a result, little consideration has been given to the ability of missing data ASR systems to handle interference caused by the interaction of a target sound with its environment (such as a transmission line, audio equipment or reverberant space). In terms of signal theory this is regarded as convolutional interference. In this paper, we propose a number of modifications to a missing data ASR system which allow it to perform robustly in the presence of convolutional noise.

A convolutional interference can be characterised by the impulse response of the corresponding system. If the length of the impulse response is short compared to the analysis window, then the interference mainly causes spectral alteration (see Avendano, 1997, Chapter 5). This follows because convolution in the time domain is equivalent to multiplication in the frequency domain (see Oppenheim and Schafer (1989) for a description of the convolution theorem of the Fourier transform). The analysis window used in speech processing is usually longer than 10 ms, which roughly corresponds to the pitch period of an average adult male voice. Examples of practical systems having short impulse responses are transmission lines, microphones and loudspeakers.

In the case of room reverberation the interaction is of a different nature, because the impulse response of a room is relatively long (from approximately 0.2 up to 5 s) compared to the window used for speech analysis. A typical room impulse response consists of sparse early reflections followed by dense late reverberation (higher-order reflections), which forms the exponentially decaying tail of the response. The sparse early reflections are highly correlated with the speech signal and often contribute usefully to speech intelligibility by increasing the loudness of the speech. However, early reflections can also cause some spectral deviation due to comb filtering caused by successive reflections and the varying frequency characteristics of surface absorption. In contrast, the dense late reverberation is poorly correlated with the original speech signal and therefore behaves more like additive noise. Indeed, early versus late reverberation has successfully been used as a predictor of speech intelligibility in rooms (Bradley, 1986). It is common to define a critical delay time for early and late reverberation, such that reflections arriving before the delay are beneficial to auditory perception whereas reflections arriving after it will have a detrimental effect. The European norm ISO 3382 (1997) suggests critical delays of 50 ms for speech and 80 ms for music perception. Gölzer and Kleinschmidt (2003) investigated the role of early and late reflections in conjunction with ASR. They suggested that reflections have a conducive effect on speech recognition accuracy up to a critical delay of 25–50 ms, assuming that late reverberation is strongly present in the room impulse response. Further details of the effect of room acoustics on speech intelligibility can be found in (Bradley, 1986; Houtgast and Steeneken, 1985).

The conventional way of tackling convolutional interference in ASR has been to use cepstral encoding, and to employ cepstral mean subtraction to remove the spectral distortion. Two common examples of cepstral encoding are mel-frequency cepstral coefficients (MFCC) (Davis and Mermelstein, 1980) and cepstral features obtained by perceptual linear prediction (PLP) (Hermansky, 1990). Interestingly, both of these approaches are loosely based on known mechanisms of auditory frequency encoding. However, they have been found to perform inadequately with reverberated speech (Kingsbury, 1998; Kingsbury et al., 1998). Reverberation can also be handled via blind source separation (BSS) using a microphone array, or via blind deconvolution or dereverberation (for an overview see Omologo et al., 1998). In such approaches, the aim is to enhance subjective speech quality rather than to find a robust acoustic encoding. BSS gives good dereverberation performance, but at least two microphone signals are needed to process a single speech source (for an overview of BSS and independent component analysis see Hyvärinen et al., 2001).

Kingsbury and his colleagues (Kingsbury, 1998; Kingsbury et al., 1998) have reported that a modulation-filtered spectral representation, the modulation spectrogram (MSG), can improve ASR performance with reverberated speech. Spectral bands are processed by a modulation filter, which emphasizes the strongest speech modulations and effectively removes reverberant or noisy regions that are not modulated in the same way as speech signals. This approach is consistent with studies that demonstrate the importance of low frequency modulations in human speech recognition (Houtgast and Steeneken, 1985; Drullman et al., 1994) and in ASR (Kandera et al., 1999).

In this study we address the problem of handling convolutional distortion in a missing data ASR system which uses spectral speech features. Two conditions are considered; one in which speech is subject to spectral distortion and additive noise, and another in which speech is reverberated. In the first case, we derive a missing data mask from estimates of the SNR in local time-frequency regions, and employ spectral subtraction to remove the noise background. Furthermore, we introduce a new method for normalising spectral features that is compatible with the missing data ASR framework. In reverberant conditions, a modulation filtering scheme is used to generate the missing data mask. This approach exploits temporal modulations of speech in order to find spectro-temporal regions which are not severely contaminated by reverberation.

The current study extends our previous work in several important respects. A related scheme for spectral normalisation was presented in (Palomäki et al., in press), but it was applied only to a very specific purpose (speech recognition using a binaural hearing model). Here, we develop and evaluate the normalisation scheme more thoroughly, and evaluate it on a more general speech recognition task with different types of spectral distortion. Our early work on modulation mask estimation (Palomäki et al., 2002) suffered from the drawback that the algorithm needed to be hand-tuned to each different reverberation condition. This problem has now been addressed by an adaptive scheme, in which the parameters of the algorithm are set according to an estimate of the degree of reverberation present in the signal. This allows the same system to be used in a wide range of reverberation conditions without the need for hand-tuning. Finally, in (Palomäki et al., 2002) the system was evaluated on a limited number of simulated room impulse responses (RIRs), whereas here we use real RIRs which vary in their T₆₀ reverberation time between 0.7 and 1.5 s. The results obtained with our new method are also compared against Kingsbury (1998) recogniser for reverberated speech, which uses MSG and PLP features.

Section 2 of the paper describes the overall architecture of the missing data ASR system and the acoustic features used. In Section 3, we present a processing pathway that is optimised for conditions in which speech is subject to spectral distortion and additive noise. A processing pathway for reverberant conditions is described in Section 4. The system is evaluated under a number of noise conditions in Section 5, and compared against a baseline approach. We conclude with a discussion in Section 6.

Section snippets

Speech recogniser

The missing data speech recognition system is shown schematically in Fig. 1. In this section we describe the front-end processing, which extracts spectral features using an auditory model, and explain the missing data ASR approach.

Processing for spectral distortion and additive noise

In this section we describe a processing pathway that compensates for spectral distortion and additive noise. Our approach is based on the combination of three techniques; estimation of a missing data mask on the basis of SNR in local time-frequency regions (Section 3.1), spectral subtraction (Section 3.2) and an approach to spectral feature normalisation which is suitable for missing data ASR in the presence of additive noise (Section 3.3).

Processing for reverberation

This section describes a processing pathway for missing data ASR in reverberant conditions (see Fig. 1). In the first stage, modulation filtering is used to derive a mask that identifies the speech features that are least contaminated by reverberation. Following this, spectral features are normalised using a modification of the technique described in Section 3.3.

Corpus and recogniser configuration

The missing data ASR system was evaluated using a subset of the Aurora 2.0 connected English language digits recognition task (Pearce and Hirsch, 2000). The sampling rate of all speech data was 8 kHz. Auditory rate maps were obtained for the clean training section of the Aurora corpus, and were used to train 12 word-level HMMs (see Section 2.2). The training section contained 8440 clean (noiseless) utterances. For the evaluation of the missing data system we trained two kinds of recognisers,

Discussion

In this paper we have described techniques for handling convolutional distortion in `missing data' speech recognition, an issue which has been largely unaddressed to date. As the convolutional interference can be quite different in nature depending upon the length of the impulse response concerned, we propose two approaches; one to handle spectral distortion due to a transmission line or audio equipment, and another to handle room reverberation interference. In summary, the results show

Acknowledgements

We thank two anonymous reviewers for their incisive comments. KJP was funded by the EC TMR SPHEAR project, the Academy of Finland (project number 1277811) and was partially supported by a Finnish Nokia säätiö grant. GJB was funded by EPSRC grant GR/R47400/01. The authors owe many thanks to Dan Ellis, Brian Kingsbury and Heidi Christensen for their kind help with implementing the MSG + PLP baseline system. Dan Ellis and Brian Kingsbury also made some of the real room impulse responses available to

References (59)

G.J Brown et al.
Computational auditory scene analysis
Comp. Speech Lang.
(1994)
M.P Cooke et al.
Robust automatic speech recognition with missing and unreliable acoustic data
Speech Comm.
(2001)
H Hermansky
Should recognisers have ears?
Speech Comm.
(1998)
B.E.D Kingsbury et al.
Robust speech recognition using the modulation spectrogram
Speech Comm.
(1998)
D Li et al.
Classification of general audio data for content-based retrieval
Pattern Recognition Lett.
(2001)
R.P Lippmann
Speech recognition by machines and humans
Speech Comm.
(1997)
M Omologo et al.
Environmental conditions and acoustic transduction in hands-free speech recognition
Speech Comm.
(1998)
M Akbacak et al.
Environmental sniffing: noise knowledge estimation for robust speech systems
Proc. ICASSP-2003
(2003)
Avendano, C., 1997. Temporal processing of speech in a time-feature space, PhD thesis, Oregon graduate...
P Assmann et al.
The perception of speech under adverse acoustic conditions

B.S Atal

Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification

J. Acoust. Soc. Am.

(1974)

J Barker et al.

Decoding speech in the presence of other sound sources

Proc. ICSLP-2000

(2000)

J Barker et al.

Soft decisions in missing data techniques for robust automatic speech recognition

Proc. ICSLP-2000

(2000)

J Barker et al.

Robust ASR based on clean speech models: An evaluation of missing data techniques for connected digit recognition in noise

Proc. Eurospeech-2001

(2001)

Barker, J., Green, P.D., Cooke, M.P., 2001b. Linking auditory scene analysis and robust ASR by missing data techniques,...

J.S Bradley

Predictors of speech intelligibility in rooms

J. Acoust. Soc. Am.

(1986)

A.S Bregman

Auditory Scene Analysis

(1990)

G.J Brown et al.

A neural oscillator sound separator for missing data speech recognition

Proc. IJCNN-2001

(2001)

R.A Cole et al.

New telephone speech corpora at CSLU

Proc. Eurospeech-1995

(1995)

M.P Cooke

Modelling Auditory Processing and Organization

(1993)

S.P Davis et al.

Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences

IEEE Trans. Acoust. Speech Signal Process.

(1980)

J Droppo et al.

Uncertainty decoding with SPLICE for noise robust speech recognition

Proc. ICASSP-2002

(2002)

R Drullman et al.

Effects of temporal envelope smearing on speech reception

J. Acoust. Soc. Amer.

(1994)

Dupont, S., Ris C., 1999. Assessing local noise level estimation methods. In: Proc. of Workshop on Robust Methods for...

A Eronen et al.

Audio context awareness––acoustic modeling and perceptual evaluation

Proc. ICASSP-2003

(2003)

B Gold et al.

Speech and Audio Signal Processing

(2000)

H Gölzer et al.

Importance of early and late reflections for automatic speech recognition in reverberant environments

Proc. Elektronische Sprachsignalverarbeitung (ESSV)

(2003)

H Hermansky

Perceptual linear predictive (PLP) analysis of speech

J. Acoust. Soc. Am.

(1990)

H Hermansky et al.

RASTA processing of speech

IEEE Trans. Speech Audio Proc.

(1994)

Cited by (59)

A novel channel estimate for noise robust speech recognition
2024, Computer Speech and Language
We propose a novel technique to estimate the channel characteristics for robust speech recognition. The method focuses on reliable time–frequency speech patches which are highly independent of the noise condition. Combined with a root-based approximation of the logarithm in the MFCC computation, this reduces the variance caused by the noise on the spectral features, and therefore also the constrain on the acoustic model in a multi-style training setup. We show that compared to the standard mean normalization, the proposed method estimates the channel equally well under clean conditions and better under noisy conditions. When integrated in the feature extraction pipeline, we show improvements in speech recognition accuracy on noisy speech and a status quo on clean speech. Our experiments reveal that this method helps the most for generative models that need to model the complex noise variability, and less so for discriminative models, which can learn to ignore noise instead of accurately modeling it. Our approach outperforms the state of the art on the noisy Aurora4 task.
Speech separation based on reliable binaural cues with two-stage neural network in noisy-reverberant environments
2020, Applied Acoustics
Background noise and room reverberation often cause a decrease in reliability of binaural cues and speech quality, especially in non-stationary environment. In order to solve these problems, we propose a novel speech separation algorithm based on two-stage neural network model and a special separation mask in noisy-reverberant environment. In this algorithm, firstly, the weight matrix is derived to construct reliable binaural cues through the first-stage neural network. The reliable binaural cues combined with complementary spectral features is used as input of separation DNN. Secondly, a special separation mask is introduced for noisy-reverberant environment, which can suppress background noise and reduce reverberation. Thirdly, the separation DNN is used as nonlinear function to estimate separation mask. Then, the two-stage neural network system is trained jointly. During the joint training process, the system adaptively adjusts the weight matrix according to the final error, which is similar to the attention mechanism introduced for binaural features. At the same time, due to the increased reliability of binaural cues, neural networks can make better use of effective information. Finally, the estimated separation mask is used to weight the noisy-reverberant speech to achieve the enhanced speech. Experimental results indicate that the proposed algorithm achieves better performance than the contrast algorithms in different scenarios with various amounts of noise and reverberation.
Robust binaural speech separation in adverse conditions based on deep neural network with modified spatial features and training target
2019, Speech Communication
Citation Excerpt :
In addition, the ITD feature is extracted from the GCC-PHAT algorithm (Knapp and Carter, 1976) and concatenated with ILD, leading to the feature, “ITD(GCC-PHAT) + ILD”. As another comparison, binary missing data masking and imputation using the method described in Palomäki et al. (2004a) is used to extract the binaural feature, called “bmITD(CCF) + bmILD”. Binary masking preserves the reliable spectro-temporal areas and discards unreliable regions (Palomäki et al., 2004a).
In this paper, a robust binaural speech separation system based on deep neural network (DNN) is introduced. The proposed system has three main processing stages. In the spectral processing stage, the multiresolution cochleagram (MRCG) feature is extracted from the beamformed signal. In the spatial processing stage, a novel reliable spatial feature of smITD + smILD is obtained by soft missing data masking of binaural cues. In the final stage, a deep neural network takes the combined spectral and spatial features and estimates a newly defined ideal ratio mask (IRM) designed for noisy and reverberant conditions. The performance of the proposed system is evaluated and compared with two recent binaural speech separation systems as baselines in various noisy and reverberant conditions. Furthermore, the performance of each processing stage is explored and compared to those of state-of-the-art approaches. A multitalker spatially diffuse babble is used as interferer at four signal-to-noise ratios (SNRs). Simulated rooms with four matched and four unmatched reverberation times (RTs) are considered in the experiments. It is shown that the proposed system outperforms the baseline systems in improving the intelligibility and quality of separated speech signals in reverberant and noisy conditions. The results confirm the efficiency of each system component, especially in highly reverberant scenarios.
Advanced parallel combined Gaussian mixture model based feature compensation integrated with iterative channel estimation
2015, Speech Communication
Citation Excerpt :
Various model adaptation techniques have been successfully employed such as the Maximum A Posteriori (MAP), Maximum Likelihood Linear Regression (MLLR) and Parallel Model Combination (PMC) (Gauvain and Lee, 1994; Leggetter and Woodland, 1995; Gales and Young, 1996; Gales, 1998). Recently, missing-feature methods have shown promising results (Barker et al., 2001; Cook et al., 2001; Palomaki et al., 2004; Raj et al., 2004; Kim and Hansen, 2009b). This study builds on work from our previously proposed feature compensation method based on Parallel Combined Gaussian Mixture Model (PCGMM, Kim and Hansen, 2009a).
This study proposes an effective feature compensation scheme to address severely adverse environments for speech recognition where background noise and channel distortion are simultaneously involved. In the proposed scheme, an iterative channel estimation method is integrated into the framework of our previously proposed Parallel Combined Gaussian Mixture Model (PCGMM) based feature compensation algorithm. A new speech corpus is developed which reflects both additive and convolutional noise corruption. The channel distortion effects are obtained from the NTIMIT and CTIMIT corpora. Evaluation based on objective measures including STNR, PESQ, and speech recognition shows that generated speech corpus includes highly challenging acoustic conditions for speech recognition. The proposed feature compensation method is evaluated over the developed speech corpus. The experimental results demonstrate that the proposed feature compensation scheme is effective at improving speech recognition performance in the presence of both background noise and channel distortion, employing the iterative channel estimation method. The proposed PCGMM-based feature compensation scheme employing the channel estimation method shows +3.58% and +11.61% relative improvements in averaged WER compared to the ETSI AFE algorithm for the developed speech corpora including NTIMIT and CTIMIT channel effects respectively. For real-life application, a voice activity detection technique is employed to estimate the noise model for PCGMM-based method without a priori knowledge of the non-speech locations of input speech. The proposed method is also evaluated on the CU-Move corpus which represents actual in-vehicle conditions, showing a +12.99% relative improvement compared to the ETSI AFE. This study confirms that the proposed PCGMM-based feature compensation method integrated with channel estimation is effective at increasing speech recognition accuracy in real-life severely adverse conditions.
A new framework for robust speech recognition in complex channel environments
2014, Digital Signal Processing: A Review Journal
Citation Excerpt :
Palomaki et al. [55] combine spectral features and cepstral features within the missing data framework to handle convolutional distortion and additive noise. They also propose a method for handling reverberated speech which attempts to identify time-frequency regions that are not badly contaminated by reverberation [56]. It is important to consider the two above-mentioned complex situations because they are widely present in real-life applications, such as speech recognition over the Internet or in distributed environments.
Channel distortion is one of the major factors which degrade the performances of automatic speech recognition (ASR) systems. Current compensation methods are generally based on the assumption that the channel distortion is a constant or slowly varying bias in an utterance or globally. However, this assumption is not sustained in a more complex circumstance, when the speech records being recognized are from many different unknown channels and have parts of the spectrum completely removed (e.g. band-limited speech). On the one hand, different channels may cause different distortions; on the other, the distortion caused by a given channel varies over the speech frames when parts of the speech spectrum are removed completely. As a result, the performance of the current methods is limited in complex environments. To solve this problem, we propose a unified framework in which the channel distortion is first divided into two subproblems, namely, spectrum missing and magnitude changing. Next, the two types of distortions are compensated with different techniques in two steps. In the first step, the speech bandwidth is detected for each utterance and the acoustic models are synthesized with clean models to compensate for spectrum missing. In the second step, the constant term of the distortion is estimated via the expectation-maximization (EM) algorithm and subtracted from the means of the synthesized model to further compensate for magnitude changing. Several databases are chosen to evaluate the proposed framework. The speech in these databases is recorded in different channels, including various microphones and band-limited channels. Moreover, to simulate more types of spectrum missing, various low-pass and band-pass filters are used to process the speech from the chosen databases. Although these databases and their filtered versions make the channel conditions more challenging for recognition, experimental results show that the proposed framework can substantially improve the performance of ASR systems in complex channel environments.
Mask estimation and imputation methods for missing data speech recognition in a multisource reverberant environment
2013, Computer Speech and Language
We present an automatic speech recognition system that uses a missing data approach to compensate for challenging environmental noise containing both additive and convolutive components. The unreliable and noise-corrupted (“missing”) components are identified using a Gaussian mixture model (GMM) classifier based on a diverse range of acoustic features. To perform speech recognition using the partially observed data, the missing components are substituted with clean speech estimates computed using both sparse imputation and cluster-based GMM imputation. Compared to two reference mask estimation techniques based on interaural level and time difference-pairs, the proposed missing data approach significantly improved the keyword accuracy rates in all signal-to-noise ratio conditions when evaluated on the CHiME reverberant multisource environment corpus. Of the imputation methods, cluster-based imputation was found to outperform sparse imputation. The highest keyword accuracy was achieved when the system was trained on imputed data, which made it more robust to possible imputation errors.

View all citing articles on Scopus

View full text

Techniques for handling convolutional distortion with `missing data' automatic speech recognition

Abstract

Introduction

Section snippets

Speech recogniser

Processing for spectral distortion and additive noise

Processing for reverberation

Corpus and recogniser configuration

Discussion

Acknowledgements

Comp. Speech Lang.

Speech Comm.

Speech Comm.

Speech Comm.

Pattern Recognition Lett.

Speech Comm.

Speech Comm.

Environmental sniffing: noise knowledge estimation for robust speech systems

Proc. ICASSP-2003

The perception of speech under adverse acoustic conditions

Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification

J. Acoust. Soc. Am.

Decoding speech in the presence of other sound sources

Proc. ICSLP-2000

Soft decisions in missing data techniques for robust automatic speech recognition

Proc. ICSLP-2000

Robust ASR based on clean speech models: An evaluation of missing data techniques for connected digit recognition in noise

Proc. Eurospeech-2001

Predictors of speech intelligibility in rooms

J. Acoust. Soc. Am.

Auditory Scene Analysis

A neural oscillator sound separator for missing data speech recognition

Proc. IJCNN-2001

New telephone speech corpora at CSLU

Proc. Eurospeech-1995

Modelling Auditory Processing and Organization

Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences

IEEE Trans. Acoust. Speech Signal Process.

Uncertainty decoding with SPLICE for noise robust speech recognition

Proc. ICASSP-2002

Effects of temporal envelope smearing on speech reception

J. Acoust. Soc. Amer.

Audio context awareness––acoustic modeling and perceptual evaluation

Proc. ICASSP-2003

Speech and Audio Signal Processing

Importance of early and late reflections for automatic speech recognition in reverberant environments

Proc. Elektronische Sprachsignalverarbeitung (ESSV)

Perceptual linear predictive (PLP) analysis of speech

J. Acoust. Soc. Am.

RASTA processing of speech

IEEE Trans. Speech Audio Proc.