Autoregressive modeling of speech trajectory transformed to the reconstructed phase space for ASR purposes
Introduction
The human speech production system is inherently a multivariate dynamic system, whose output is normally recorded as a one-dimensional (scalar) signal, designated as a discrete-time speech signal. In [1], [2], [3], it is demonstrated that the speech signal shows some chaotic behaviors during its production, due to some factors such as nonlinear vibration of the vocal folds and turbulent movements of the airflow in the vocal tract. So, the one-dimensional speech signal can be described as a time series, in the form of a multivariate representation in the reconstructed phase space (RPS) or state space (SS) [4]. This mapping could be followed in such a way that the behavior of the trajectory attributed to the signal in the phase space conforms to the actual trajectory of the system generating it, from the geometrical and topological viewpoints [4]. Accordingly, the embedding space or the RPS methods, which are some of the most notable techniques presented in the study of nonlinear and chaotic dynamics, were introduced by Takens [5]. On the basis of this idea, even for a multidimensional dynamic system, a record of a single-variable time series is often sufficient to determine the full dynamics of this system [4]. As a result, considering the manifold obtained through mapping of the speech signal to the RPS, it is expected to extract valuable information from the speech signal.
Recently, some researches are reported in the field of speech processing based on embedding the speech signal in the RPS [6], [7]. The basic idea behind this line of research is that the extraction of the information associated with the trajectory of the speech signal can generate discriminant information ignored in traditional speech signal analysis methods such as spectral-based feature extraction algorithms [8], [9].
Speech signal processing methods based upon the RPS have been developed in various applications. As an example, in [10] an algorithm for pitch mark determination of the speech signal has been proposed, based on dynamical systems theory. Although limited studies have been carried out in order to take advantage of signal embedding in continuous speech recognition (CSR) systems, the proposed techniques so far have not yet produced considerable improvements in this task. Most of the work in this field uses chaotic features (such as correlation dimension or fractal dimension) and their extensions, or as a special case, modeling of distributions associated with embedded signals in the RPS. For instance, Pitsikalis in [11], [12] attempts to improve the performance of speech recognition systems by adding chaotic features, such as “correlation sum”, “generalized fractal dimension”, and “Lyapunov exponents” to the traditional features used in a typical phoneme recognition systems. In [13], the application of the Lyapunov exponent as a characterization of the speech signal, and approximating its dynamics by some nonlinear models, such as global or local polynomials, radial basis function networks, fuzzy-logic systems and support vector machines are introduced. Also in [14], two RPS-based features extracted from speech signals are introduced. These features are scalar values representing the scalar displacement (Ds) and radial displacement (Dr) of a signal trajectory in the phase space. As an example, in fricative phonemes, Ds and Dr assume higher values than for vowels. In [15], a time-domain- and RPS-based approach is proposed to model and classify consonant-vowel (CV) speech unit waveforms. Using the state space point distribution (SSPD) of the embedded speech signal in the RPS, a feature extraction method is proposed and implemented over some Malayalam utterances.
In 2002, for the first time, Ye et al. employed statistical distributions of embedded speech signals in the RPS, taking advantage of the histogram method, in order to classify isolated phonemes [16]. Subsequently, they exploited principal component analysis (PCA) in order to orthogonalize and reduce the dimension of the signals embedded in the RPS [17]. Liu introduced a method for classification of vowels based on a distance measure between the attractors of phonemes in the RPS [18]. In [19], [20], [21], [22], [23], some isolated phoneme speech recognition systems that model the speech trajectory in the RPS with phoneme-specific Gaussian mixture models (GMMs) and perform maximum likelihood classification were tested on TIMIT. Moreover, Indrebo proposed a method for decomposition of a speech signal to its sub-bands, through combination of RPS and filter banks [24]. Exploitation of a sub-band decomposition of the signal could lead to more robust recognition results [25]. Jafari et al. in [26] proposed a feature vector whose elements are a combination of the popular Mel-frequency cepstral coefficients (MFCC) and some other features extracted from the RPS. In this approach, RPS-based features are attained through parametric modeling of the “Poincare section” of the signal embedded in the RPS. This modeling is performed using GMM. The mixture parameters, i.e. the means, variances and component weights, are exploited as the extracted features.
In recent research, the RPS-based features are typically employed in isolated phoneme recognition tasks. In a very limited number of cases these features are utilized for CSR as a set of additional features that are appended to the traditional speech features. This shows that the RPS-based feature extraction methods are still in need of further development, and more studies are required in this field. In this paper, we intend to move in this way and propose an extension of the linear prediction (LP)-based feature extraction algorithm, in order to get proper information from speech signals embedded in the RPS, which could be worthwhile to employ in speech recognition applications. The basis of the proposed method is a kind of parametric modeling of the speech trajectory in the RPS using multivariate autoregressive (MVAR) modeling. The obtained MVAR coefficients are a crude set of parameters that must be post-processed for use in automatic speech recognition (ASR) systems.
MVAR modeling is a development of the LP-analysis method, which is a popular time-domain technique in many fields of speech processing, including coding, synthesis and recognition of speech signals [27]. Through this method, the linear modeling of a one-dimensional speech signal is realized. Some extensions of this method have been suggested, based on the reflection and cepstrum coefficients, which could improve the speech recognition results [28]. In the 1990s, after the advent of feature extraction methods such as MFCCs, which are based on filter banks and the cepstrum and have been shown to be superior to the time-domain-based features in many speech recognition tasks, the application of LPC-based features in this field became quite limited.
The MVAR algorithm has been used to model multidimensional signals whose information is recorded simultaneously through several channels (such as ECG and EEG) [29], [30], [31], [32], [33]. In such applications, multi-channel recording of a signal leads to a multidimensional signal. However, application of the MVAR method in speech recognition has not been considered. This may be because the speech signal is primarily a one-dimensional signal. However, by transferring it to the RPS, a multidimensional signal will be formed, which contains attractor characteristics of the considered speech signal.
The remainder of this paper is organized as follows: in Section 2, a review of the general structure of an automatic speech recognition system is provided; in Section 3, the details associated with the feature extraction methods being applied in this work are discussed; Section 4 presents the experimental results with all the details and conditions; finally, Sections 5 and 6 provide the discussions and the conclusions, respectively.
Section snippets
The speech recognition framework
An ASR system is comprised of several main blocks. The first stage of the system, the feature extraction block, is used to convert the speech signal into a sequence of feature vectors. The extracted features must carry sufficient information from the input signal in order to be employed to discriminate between various speech patterns (e.g. different phonemes). The second block searches for the best sequence of phonemes or words which could be finally recognized. For this purpose an acoustic
Feature extraction from the speech signal
The input speech signal of a recognition system is a discrete signal with a constant sampling rate. The samples must be transformed to a sequence of vectors, so-called feature or observation vectors. Feature vectors must include the information needed to discriminate different acoustic units. Generally, the speech signal is segmented into overlapping windows to produce frames of speech. The typical window length is 25 milliseconds and the frame shift is 10 milliseconds. Extraction of feature
Experiments
This section details the databases used and various experiments on the proposed method and some conventional speech feature extraction approaches; the results are obtained in two applications of continuous ASR systems and isolated phoneme recognition.
Discussion
The results obtained through implementation of the aforementioned methods show that simultaneous application of signal embedding theory (through the RPS) and its multidimensional modeling (by MVAR method) can provide more useful and discriminant information compared to the case of applying a similar modeling approach to the original one-dimensional speech signal. Moreover, in this research, we showed that the proposed method is a practical RPS-based approach which could be directly employed in
Conclusion
In this paper, a new speech feature extraction method is introduced taking advantage of the reconstructed phase space (RPS) properties, the multivariate linear prediction model (using MVAR approach) and LDA dimension reduction method. In the proposed method, unlike the typical approaches in feature extraction for continuous speech recognition systems where the feature vector is extracted from a one-dimensional speech signal, the required features (MVLPREF) are obtained through calculation of
Yasser Shekofteh received his BS in biomedical engineering and electrical engineering from Amirkabir University of Technology, Tehran, Iran, in 2005 and 2006, respectively. He received his MS in biomedical engineering from Amirkabir University of Technology in 2008. He is currently a PhD candidate in the Biomedical Engineering Department at Amirkabir University of Technology. His research is mainly focused on signal processing, speech recognition, and keyword spotting.
References (59)
- et al.
Chaos in voice from modeling to measurement
J. Voice
(2006) - et al.
Poincare pitch marks
Speech Commun.
(2006) - et al.
Sub-banded reconstructed phase spaces for speech recognition
Speech Commun.
(2006) - et al.
Joint order and parameter estimation of multivariate autoregressive models using multi-model partitioning theory
Digital Signal Process.
(2006) Autoregressive spectral estimation in noise with reference to speech analysis
Digital Signal Process.
(1991)- et al.
Hybrid statistical pronunciation models designed to be trained by a medium-size corpus
Comput. Speech Lang.
(2009) - et al.
The integration of principal component analysis and cepstral mean subtraction in parallel model combination for robust speech recognition
Digital Signal Process.
(2011) - et al.
A new representation for speech frame recognition based on redundant wavelet filter banks
Speech Commun.
(2012) - et al.
Speech database development at MIT, TIMIT, and beyond
Speech Commun.
(1990) - et al.
Is speech chaotic? Invariant geometrical measures for speech data
Speech characterization and synthesis by nonlinear methods
IEEE Trans. Audio Speech Lang. Process.
Nonlinear Time Series Analysis
Detecting strange attractors in turbulence
Nonlinear processing of speech
Some advances in nonlinear speech modeling using modulations, fractals, and chaos
Comparison of parametric representations for monosyllable word recognition in continuously spoken sentences
IEEE Trans. Audio Speech Lang. Process.
Perceptual linear predictive (PLP) analysis of speech
J. Acoust. Soc. Amer.
Speech analysis and feature extraction using chaotic models
Nonlinear analysis of speech signals: Generalized dimensions and Lyapunov exponents
Nonlinear speech analysis using models for chaotic systems
IEEE Trans. Audio Speech Lang. Process.
A new time-domain feature parameter for phoneme classification
Time-domain non-linear feature parameter for consonant classification
Int. J. Speech Technol.
Phoneme classification using naive Bayes classifier in reconstructed phase space
Phoneme classification over the reconstructed phase space using principal component analysis
Vowel classification by global dynamic modeling
Speech recognition using reconstructed phase space features
Joint frequency domain and reconstructed phase space features for speech recognition
Time series classification using Gaussian mixture models of reconstructed phase spaces
IEEE Trans. Knowl. Data Eng.
Time-domain isolated phoneme classification using reconstructed phase spaces
IEEE Trans. Speech Audio Process.
Cited by (17)
Improvement of automatic speech recognition systems via nonlinear dynamical features evaluated from the recurrence plot of speech signals
2017, Computers and Electrical EngineeringCitation Excerpt :In such cases, the frequency domain techniques are deficient, because it is impossible to dissociate the information of such a signal only in the frequency domain [5]. In addition, the speech signal shows some chaotic behaviors due to the existence of nonlinear phenomena such as the turbulence [9]. Studies about dynamical systems and chaos theory resulted in a kind of signal representation, a multi-dimensional trajectory embedded in the reconstructed phase space [4–6].
Variational Bayesian learning for robust AR modeling with the presence of sparse impulse noise
2016, Digital Signal Processing: A Review JournalModel-based clustered sparse imputation for noise robust speech recognition
2016, Speech CommunicationMLP-based isolated phoneme classification using likelihood features extracted from reconstructed phase space
2015, Engineering Applications of Artificial IntelligenceCitation Excerpt :To examine the accuracy of the proposed approach in comparison to other techniques, we conducted an exhaustive series of experiment. We used several different feature vectors (MFCC, LLRPS, or combination of them) and several traditional classifiers that have previously been reported (state space point distribution (SSPD) (Thasleema et al., 2012), reflection-based multivariate linear prediction features (MVLPREF) (Shekofteh and Almasganj, 2013b), multi-layer perceptron (MLP), multiclass support vector machine (SVM) with linear kernel (Chang and Lin, 2011; Hsu and Lin, 2002), and naive Bayes classifier (NBC) with the Gaussian kernel distribution (John and Langley, 1995). We employed these techniques for an isolated phoneme classification task over the FARSDAT database.
Yasser Shekofteh received his BS in biomedical engineering and electrical engineering from Amirkabir University of Technology, Tehran, Iran, in 2005 and 2006, respectively. He received his MS in biomedical engineering from Amirkabir University of Technology in 2008. He is currently a PhD candidate in the Biomedical Engineering Department at Amirkabir University of Technology. His research is mainly focused on signal processing, speech recognition, and keyword spotting.
Farshad Almasganj received his MS in electrical engineering from Amirkabir University of Technology, Tehran, Iran, in 1987 and his PhD in biomedical engineering from Tarbiat Modares University, Tehran, Iran, in 1998. He is currently an associate professor in the Biomedical Engineering Department of Amirkabir University of Technology. His research interests include automatic detection of voice disorders, speech recognition, prosody, and language modeling for ASR systems.