Joint variable frame rate and length analysis for speech recognition under adverse conditions

doi:10.1016/j.compeleceng.2014.09.002

Computers & Electrical Engineering

Volume 40, Issue 7, October 2014, Pages 2139-2149

https://doi.org/10.1016/j.compeleceng.2014.09.002 Get rights and content

Highlights

•
A computationally efficient variable frame length and rate method is proposed.
•
The method relies on a posteriori signal-to-noise ratio weighted energy distance.
•
The method improves speech recognition accuracy in noisy environments.
•
Setting a proper range of frame length that is allowed to choose from is important.

Abstract

This paper presents a method that combines variable frame length and rate analysis for speech recognition in noisy environments, together with an investigation of the effect of different frame lengths on speech recognition performance. The method adopts frame selection using an a posteriori signal-to-noise (SNR) ratio weighted energy distance and increases the length of the selected frames, according to the number of non-selected preceding frames. It assigns a higher frame rate and a normal frame length to a rapidly changing and high SNR region of a speech signal, and a lower frame rate and an increased frame length to a steady or low SNR region. The speech recognition results show that the proposed variable frame rate and length method outperforms fixed frame rate and length analysis, as well as standalone variable frame rate analysis in terms of noise-robustness.

Graphical abstract

Introduction

Speech signal analysis is generally performed over short-time frames with a fixed length (FFL) and a fixed frame rate (FFR), based on the assumption that speech signals are non-stationary and, exhibit quasi-stationary behavior in short durations. This fixed frame rate and length (FFRL) analysis is not optimal, since some parts of the signals (e.g. vowels) are stationary over a longer duration compared to others (e.g. consonants and transient speech) that have shorter durations. Consequently, variable frame rate (VFR) and variable frame length (VFL) analysis methods have been proposed for speaker recognition and speech recognition [1], [2].

Variable frame rate analysis selects frames according to the signal characteristics. Initially, speech feature vectors (frames) are first extracted at a fixed frame rate and then the decision for the retaining frames is based on distance measures and thresholds [3], [4], [5]. The Euclidean distance between the last retained feature vector and the current vector is calculated as the distance measure in [3]. The current frame is discarded if the measure is smaller than the predefined threshold, aimed at reducing the computational load.

Recent research in VFR analysis moves towards finding optimal representation of a speech signal to improve performance in noisy environments. This requires frame analysis in steps smaller than the standard 10 ms, while the average frame rate largely remains unchanged. In [4], an effective VFR method was proposed, that uses a 25 ms frame length with a 2.5 ms frame shift for calculating Mel-frequency cepstral coefficients (MFCCs) and, conducts frame selection based on an energy weighted cepstral distance. The method significantly improves the recognition accuracy in noisy environments at the cost of degraded performance for clean speech. In [5], an entropy measure instead of a cepstral distance is used, resulting in recognition performance improvement and higher complexity. To provide a fine resolution for rapidly changing events, these methods examine speech signals at much shorter intervals (i.e. 2.5 ms) compared to the normal frame shift of 10 ms. The algorithms extract features such as MFCCs and entropy at a high frame rate for frame selection, which is computationally expensive. An effective energy based frame selection method was proposed in [6] and it uses delta logarithmic energy as the criterion for determining the size of the frame shift, on the basis of a sample-by-sample search. Evidently, energy based search is more computationally efficient. Speech segments are accounted in speech recognition not only on their characteristics (measured by MFCCs, energy and so on), but also on their reliability. Therefore, a low-complexity VFR method, based on the a posteriori signal-to-noise ratio (SNR) weighted energy distance was proposed in [2].

While VFR analysis has been used for improving the noise-robustness of speech recognition – a primary challenge in the field, to the best of our knowledge, VFL analysis has rarely been exploited in dealing with this problem. One exception is a pseudo pitch synchronous analysis method that uses variable frame size and/or frame offset to align frames to natural pitch cycles [7]. Three pitch synchronization methods are presented: depitch, syncpitch and padpitch. On Aurora 2 database, using multi-condition training, all these methods perform worse than the baseline (without pitch synchronization processing) for clean, 20 dB, 15 dB and 10 dB conditions. Depitch is worse than the baseline on all conditions, syncpitch only performs better than the baseline for −5 dB, and padpitch performs better for −5 dB, 0 dB and equally for 5 dB.

For general speech recognition, rather than focusing on noise-robustness, a speaking rate normalization technique that adjusts both the frame rate and frame size (i.e. VFRL) is implemented on a state-of-the-art speech recognition architecture and evaluated on the GALE broadcast transcription tasks [8]. By warping the step size and the window size in the front-end according the speaking rate, the technique shows consistent improvement on all systems and gives the lowest decoding error rates of the corresponding test sets. Instead of using fixed-length frames, a segment-based recognizer represents the observation space as a graph, in which each arc corresponds to a hypothesized variable-length segment [9].

The a posteriori SNR weighted energy distance based VFR method proposed in [2] has shown to be able to assign more frames to fast changing events and less frames to steady or low SNR regions, even for very low SNR signals, thus significantly improving noise-robustness. The method can be combined with VFL analysis through a natural way of determining frame length: Extend the frame length when less frames are selected. Specifically, the lengths of the selected frames are extended when their preceding frames are not selected, for which motivations and details are presented in Section 2. As a result, the frame length is kept as normal in the fast changing regions, whereas it is increased in the steady or low SNR regions. The proposed VFRL method is applied to speech recognition in noisy environments.

As the VFRL method operates in the time domain in the sense that it decides which frame to retain, it has a good potential to be combined with other robustness methods which in general operate in the feature or model domain, to reduce the mismatch between the training and test speech signals. Feature based methods include feature enhancement, distribution normalization and noise robust feature extraction. Feature enhancement attempts to remove the noise from the signal, such as in spectral subtraction (SS) [10], non-local means de-noising [11] and vector Taylor series (VTS) [12]. Distribution normalization reduces the distribution mismatches between training and test speech, for example in cepstral mean and variance normalization (CMVN) [13]. Noise robust features include improved MFCCs [14], and the newly proposed features called power-normalized cepstral coefficients [15]. Acoustic modelling approach called deep neural networks [16] has recently attracted a significant amount of attention in the field of noise robust speech recognition. In this work, the VFRL analysis is combined with minimum statistics noise estimation based SS [10], [17].

The remainder of this paper is organized as follows: Section 2 presents the proposed variable frame rate and length algorithm. The experimental results and discussions are given in Section 3. Section 4 investigates the effect of frame length on speech recognition performance. Finally, Section 5 concludes this work.

Section snippets

Variable frame rate and length algorithm

This section presents an a posteriori SNR weighted energy distance based VFRL method and, shows the illustrative results of frame selection and length determination.

Speech recognition experiments and discussions

This section evaluates the proposed method through a number of experiments. Further we combine it with spectral-domain method and present experimental results.

Analysis of the effect of frame length

An interesting and important question is how the different frame lengths have impact on the performance of speech recognition. In general, frame length for speech analysis is determined so that it is short enough to keep unchanged the speech properties of interest roughly within the frame and long enough to be able to estimate the desired parameters [24]. In [25], it is shown that longer speech segments can be recognized more accurately from noise compared to shorter ones in the context of

Conclusions

This paper has shown that the proposed variable frame length and rate method, using accumulative a posteriori SNR weighted energy distance, is able to assign more frames with normal lengths to fast changing events and, fewer frames with larger frame lengths to steady regions. The variable frame rate analysis targets at finding the right time resolution at the signal level while the variable frame length analysis targets at the right time–frequency resolution at the frame level for noisy speech.

Zheng-Hua Tan is an associate professor at Aalborg University, Denmark. His research interests include speech processing, multimodal sensing, human–robot interaction, and machine learning. He has a PhD in electronic engineering from Shanghai Jiao Tong University, China. He was a Visiting Scientist at MIT, USA, an Associate Professor at Shanghai Jiao Tong University, and a postdoctoral fellow at KAIST, Korea.

References (25)

J.R. Glass
A probabilistic framework for segment-based speech recognition
Comput Speech Lang
(2003)
O. Viikki et al.
Cepstral domain segmental feature vector normalization for noise robust speech recognition
Speech Commun
(1998)
Jung C-S, Han KJ, Seo H, Narayanan SS, Kang HG. A variable frame length and rate algorithm based on the spectral...
Z.-H. Tan et al.
Low-complexity variable frame rate analysis for speech recognition and voice activity detection
IEEE J Sel Top Signal Proc
(2010)
K.M. Pointing et al.
The use of variable frame rate analysis in speech recognition
Comput Speech Lang
(1991)
Zhu Q, Alwan A. On the use of variable frame rate analysis in speech recognition. In: Proceedings of ICASSP-2000,...
You H, Zhu Q, Alwan A. Entropy-based variable frame rate analysis of speech signals and its application to ASR. In:...
Epps J, Choi E. An energy search approach to variable frame rate front-end processing for robust ASR. In: Proceedings...
R.D. Zilca et al.
Pseudo pitch synchronous analysis of speech with applications to speaker recognition
IEEE Trans Audio Speech Lang Process
(2006)
Chu S, Povey D. Speaking rate adaptation using continuous frame rate normalization. In: Proceedings of ICASSP 2010,...

Martin R. Spectral Subtraction based on Minimum Statistics. In: Proceedings of EUSIPCO, Edinburgh, Scotland, UK;...

H. Xu et al.

Robust speech recognition by non-local means de-noising processing

IEEE Signal Process Lett

(2008)

Cited by (8)

An adaptive cepstrum feature representation method with variable frame length and variable filter banks for acoustic emission signals
2024, Mechanical Systems and Signal Processing
Acoustic emission technology is widely employed in defect detection, structural health monitoring, fault diagnosis, and other applications due to its benefits of non-contact, real-time, sensitivity, and flexibility. To implement efficient and timely intelligent monitoring, it is crucial to promptly and precisely gather the time-varying properties in the non-stationary acoustic emission signals. The current work proposes an adaptive cepstrum feature representation method that combines variable frame length and variable filter bank. The variable frame length is implemented based on the spectral distribution correlation, and the optimal frame length and frame rate are selected by iterative operations. The variable filter bank, on the other hand, characterizes the local measure features of variable frame length signals by setting dynamic weight distributions. The proposed adaptive cepstrum algorithm can be used as an auxiliary tool to initially analyze complex non-stationary signals, and thus effectively identify the time-varying and frequency components in acoustic emission signals. The superiority of the proposed method has been confirmed on simulation data. Furthermore, the practical performance of the proposed method is evaluated and illustrated based on the burst signal from laser shock peening processing and the mixed signal from pipeline leakage. The proposed method can be extended to other broadband signals and can be used for feature pre-processing before deep learning network modeling to improve recognition accuracy due to its high time–frequency resolution.
Empirical Mode Decomposition for adaptive AM-FM analysis of Speech: A Review
2017, Speech Communication
Citation Excerpt :
As such, the first derivative (Δ or velocity coefficients), and the second derivative (ΔΔ or acceleration coefficients) are often utilized on top of the normal speech processing features, as in the case of MFCCs (Cosi, 1998). To minimize this limitation, efforts have also been made to utilize dynamic frame rate and length based on the time-varying properties of the speech signal (Tan and Kraljevski, 2014; Jung et al., 2010; Zhu and Alwan, 2000). However, none of these solutions which try to capture the dynamics of the speech signal, just like WT, cater to the problem of non-linearity of the speech signal.
This work reviews the advancements in the non-conventional analysis of speech signals, particularly from an AM-FM analysis point of view. The benefits of such an analysis, as opposed to the traditional short-time analysis of speech, is illustrated in this work. The inherent non-linearity of the speech production system is discussed. The limitations of Fourier analysis, Linear Prediction (LP) analysis, and the Mel Filterbank Cepstral Coefficients (MFCCs), are presented, thus providing the motivation for the AM-FM representation of speech. The principle and methodology of traditional AM-FM analysis is discussed, as a method of capturing the non-linear dynamics of the speech signal. The technique of Empirical Mode Decomposition (EMD) is then introduced as a means of performing adaptive AM-FM analysis of speech, alleviating the limitations of the fixed analysis provided by the traditional AM-FM methodology. The merits and demerits of EMD with respect to traditional AM-FM analysis is discussed. The developments of EMD to counter its demerits are presented. Selected applications of EMD in speech processing are briefly reviewed. The paper concludes by pointing out some aspects of speech processing where EMD might be explored.
A novel voice activity detection algorithm using modified global thresholding
2021, International Journal of Speech Technology
Raw speech-to-articulatory inversion by temporal filtering and decimation
2021, Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Voice Activity Detector for Device with Small Processor and Memory
2019, ICSECC 2019 - International Conference on Sustainable Engineering and Creative Computing: New Idea, New Innovation, Proceedings
Frame Selection for Robust Speaker Identification: A Hybrid Approach
2017, Wireless Personal Communications

View all citing articles on Scopus

Ivan Kraljevski obtained his PhD degree at the Faculty of Electrical Engineering and Information Technology, University “St. Cyril and Methodius”, Skopje, Macedonia. His scientific and professional interests include: Speech and Audio Signal Processing, Speech Recognition, Speech Synthesis, Speaker Identification, Noise Robust Speech Recognition, Pattern Recognition and Artificial Neural Networks. Current position is Speech Communication Engineer at VoiceINTERConnect GmbH, Dresden, Germany.

^☆: Reviews processed and approved for publication by the Editor-in-Chief.

View full text

Joint variable frame rate and length analysis for speech recognition under adverse conditions☆

Highlights

Abstract

Graphical abstract

Introduction

Section snippets

Variable frame rate and length algorithm

Speech recognition experiments and discussions

Analysis of the effect of frame length

Conclusions

Comput Speech Lang

Speech Commun

Low-complexity variable frame rate analysis for speech recognition and voice activity detection

IEEE J Sel Top Signal Proc

The use of variable frame rate analysis in speech recognition

Comput Speech Lang

Pseudo pitch synchronous analysis of speech with applications to speaker recognition

IEEE Trans Audio Speech Lang Process

Robust speech recognition by non-local means de-noising processing

IEEE Signal Process Lett