Elsevier

Digital Signal Processing

Volume 23, Issue 6, December 2013, Pages 1923-1932
Digital Signal Processing

Autoregressive modeling of speech trajectory transformed to the reconstructed phase space for ASR purposes

https://doi.org/10.1016/j.dsp.2013.06.011Get rights and content

Abstract

Investigating new effective feature extraction methods applied to the speech signal is an important approach to improve the performance of automatic speech recognition (ASR) systems. Owing to the fact that the reconstructed phase space (RPS) is a proper field for true detection of signal dynamics, in this paper we propose a new method for feature extraction from the trajectory of the speech signal in the RPS. This method is based upon modeling the speech trajectory using the multivariate autoregressive (MVAR) method. Moreover, in the following, we benefit from linear discriminant analysis (LDA) for dimension reduction. The LDA technique is utilized to simultaneously decorrelate and reduce the dimension of the final feature set. Experimental results show that the MVAR of order 6 is appropriate for modeling the trajectory of speech signals in the RPS. In this study recognition experiments are conducted with an HMM-based continuous speech recognition system and a naive Bayes isolated phoneme classifier on the Persian FARSDAT and American English TIMIT corpora to compare the proposed features to some older RPS-based and traditional spectral-based MFCC features.

Introduction

The human speech production system is inherently a multivariate dynamic system, whose output is normally recorded as a one-dimensional (scalar) signal, designated as a discrete-time speech signal. In [1], [2], [3], it is demonstrated that the speech signal shows some chaotic behaviors during its production, due to some factors such as nonlinear vibration of the vocal folds and turbulent movements of the airflow in the vocal tract. So, the one-dimensional speech signal can be described as a time series, in the form of a multivariate representation in the reconstructed phase space (RPS) or state space (SS) [4]. This mapping could be followed in such a way that the behavior of the trajectory attributed to the signal in the phase space conforms to the actual trajectory of the system generating it, from the geometrical and topological viewpoints [4]. Accordingly, the embedding space or the RPS methods, which are some of the most notable techniques presented in the study of nonlinear and chaotic dynamics, were introduced by Takens [5]. On the basis of this idea, even for a multidimensional dynamic system, a record of a single-variable time series is often sufficient to determine the full dynamics of this system [4]. As a result, considering the manifold obtained through mapping of the speech signal to the RPS, it is expected to extract valuable information from the speech signal.

Recently, some researches are reported in the field of speech processing based on embedding the speech signal in the RPS [6], [7]. The basic idea behind this line of research is that the extraction of the information associated with the trajectory of the speech signal can generate discriminant information ignored in traditional speech signal analysis methods such as spectral-based feature extraction algorithms [8], [9].

Speech signal processing methods based upon the RPS have been developed in various applications. As an example, in [10] an algorithm for pitch mark determination of the speech signal has been proposed, based on dynamical systems theory. Although limited studies have been carried out in order to take advantage of signal embedding in continuous speech recognition (CSR) systems, the proposed techniques so far have not yet produced considerable improvements in this task. Most of the work in this field uses chaotic features (such as correlation dimension or fractal dimension) and their extensions, or as a special case, modeling of distributions associated with embedded signals in the RPS. For instance, Pitsikalis in [11], [12] attempts to improve the performance of speech recognition systems by adding chaotic features, such as “correlation sum”, “generalized fractal dimension”, and “Lyapunov exponents” to the traditional features used in a typical phoneme recognition systems. In [13], the application of the Lyapunov exponent as a characterization of the speech signal, and approximating its dynamics by some nonlinear models, such as global or local polynomials, radial basis function networks, fuzzy-logic systems and support vector machines are introduced. Also in [14], two RPS-based features extracted from speech signals are introduced. These features are scalar values representing the scalar displacement (Ds) and radial displacement (Dr) of a signal trajectory in the phase space. As an example, in fricative phonemes, Ds and Dr assume higher values than for vowels. In [15], a time-domain- and RPS-based approach is proposed to model and classify consonant-vowel (CV) speech unit waveforms. Using the state space point distribution (SSPD) of the embedded speech signal in the RPS, a feature extraction method is proposed and implemented over some Malayalam utterances.

In 2002, for the first time, Ye et al. employed statistical distributions of embedded speech signals in the RPS, taking advantage of the histogram method, in order to classify isolated phonemes [16]. Subsequently, they exploited principal component analysis (PCA) in order to orthogonalize and reduce the dimension of the signals embedded in the RPS [17]. Liu introduced a method for classification of vowels based on a distance measure between the attractors of phonemes in the RPS [18]. In [19], [20], [21], [22], [23], some isolated phoneme speech recognition systems that model the speech trajectory in the RPS with phoneme-specific Gaussian mixture models (GMMs) and perform maximum likelihood classification were tested on TIMIT. Moreover, Indrebo proposed a method for decomposition of a speech signal to its sub-bands, through combination of RPS and filter banks [24]. Exploitation of a sub-band decomposition of the signal could lead to more robust recognition results [25]. Jafari et al. in [26] proposed a feature vector whose elements are a combination of the popular Mel-frequency cepstral coefficients (MFCC) and some other features extracted from the RPS. In this approach, RPS-based features are attained through parametric modeling of the “Poincare section” of the signal embedded in the RPS. This modeling is performed using GMM. The mixture parameters, i.e. the means, variances and component weights, are exploited as the extracted features.

In recent research, the RPS-based features are typically employed in isolated phoneme recognition tasks. In a very limited number of cases these features are utilized for CSR as a set of additional features that are appended to the traditional speech features. This shows that the RPS-based feature extraction methods are still in need of further development, and more studies are required in this field. In this paper, we intend to move in this way and propose an extension of the linear prediction (LP)-based feature extraction algorithm, in order to get proper information from speech signals embedded in the RPS, which could be worthwhile to employ in speech recognition applications. The basis of the proposed method is a kind of parametric modeling of the speech trajectory in the RPS using multivariate autoregressive (MVAR) modeling. The obtained MVAR coefficients are a crude set of parameters that must be post-processed for use in automatic speech recognition (ASR) systems.

MVAR modeling is a development of the LP-analysis method, which is a popular time-domain technique in many fields of speech processing, including coding, synthesis and recognition of speech signals [27]. Through this method, the linear modeling of a one-dimensional speech signal is realized. Some extensions of this method have been suggested, based on the reflection and cepstrum coefficients, which could improve the speech recognition results [28]. In the 1990s, after the advent of feature extraction methods such as MFCCs, which are based on filter banks and the cepstrum and have been shown to be superior to the time-domain-based features in many speech recognition tasks, the application of LPC-based features in this field became quite limited.

The MVAR algorithm has been used to model multidimensional signals whose information is recorded simultaneously through several channels (such as ECG and EEG) [29], [30], [31], [32], [33]. In such applications, multi-channel recording of a signal leads to a multidimensional signal. However, application of the MVAR method in speech recognition has not been considered. This may be because the speech signal is primarily a one-dimensional signal. However, by transferring it to the RPS, a multidimensional signal will be formed, which contains attractor characteristics of the considered speech signal.

The remainder of this paper is organized as follows: in Section 2, a review of the general structure of an automatic speech recognition system is provided; in Section 3, the details associated with the feature extraction methods being applied in this work are discussed; Section 4 presents the experimental results with all the details and conditions; finally, Sections 5 and 6 provide the discussions and the conclusions, respectively.

Section snippets

The speech recognition framework

An ASR system is comprised of several main blocks. The first stage of the system, the feature extraction block, is used to convert the speech signal into a sequence of feature vectors. The extracted features must carry sufficient information from the input signal in order to be employed to discriminate between various speech patterns (e.g. different phonemes). The second block searches for the best sequence of phonemes or words which could be finally recognized. For this purpose an acoustic

Feature extraction from the speech signal

The input speech signal of a recognition system is a discrete signal with a constant sampling rate. The samples must be transformed to a sequence of vectors, so-called feature or observation vectors. Feature vectors must include the information needed to discriminate different acoustic units. Generally, the speech signal is segmented into overlapping windows to produce frames of speech. The typical window length is 25 milliseconds and the frame shift is 10 milliseconds. Extraction of feature

Experiments

This section details the databases used and various experiments on the proposed method and some conventional speech feature extraction approaches; the results are obtained in two applications of continuous ASR systems and isolated phoneme recognition.

Discussion

The results obtained through implementation of the aforementioned methods show that simultaneous application of signal embedding theory (through the RPS) and its multidimensional modeling (by MVAR method) can provide more useful and discriminant information compared to the case of applying a similar modeling approach to the original one-dimensional speech signal. Moreover, in this research, we showed that the proposed method is a practical RPS-based approach which could be directly employed in

Conclusion

In this paper, a new speech feature extraction method is introduced taking advantage of the reconstructed phase space (RPS) properties, the multivariate linear prediction model (using MVAR approach) and LDA dimension reduction method. In the proposed method, unlike the typical approaches in feature extraction for continuous speech recognition systems where the feature vector is extracted from a one-dimensional speech signal, the required features (MVLPREF) are obtained through calculation of

Yasser Shekofteh received his BS in biomedical engineering and electrical engineering from Amirkabir University of Technology, Tehran, Iran, in 2005 and 2006, respectively. He received his MS in biomedical engineering from Amirkabir University of Technology in 2008. He is currently a PhD candidate in the Biomedical Engineering Department at Amirkabir University of Technology. His research is mainly focused on signal processing, speech recognition, and keyword spotting.

References (59)

  • M. Banbrook et al.

    Speech characterization and synthesis by nonlinear methods

    IEEE Trans. Audio Speech Lang. Process.

    (1999)
  • H. Kantz et al.

    Nonlinear Time Series Analysis

    (1997)
  • F. Takens

    Detecting strange attractors in turbulence

  • G. Kubin

    Nonlinear processing of speech

  • P. Maragos et al.

    Some advances in nonlinear speech modeling using modulations, fractals, and chaos

  • S.B. Davis et al.

    Comparison of parametric representations for monosyllable word recognition in continuously spoken sentences

    IEEE Trans. Audio Speech Lang. Process.

    (1980)
  • H. Hermansky

    Perceptual linear predictive (PLP) analysis of speech

    J. Acoust. Soc. Amer.

    (1990)
  • V. Pitsikalis et al.

    Speech analysis and feature extraction using chaotic models

  • V. Pitsikalis et al.

    Nonlinear analysis of speech signals: Generalized dimensions and Lyapunov exponents

  • I. Kokkinos et al.

    Nonlinear speech analysis using models for chaotic systems

    IEEE Trans. Audio Speech Lang. Process.

    (2005)
  • S. Yu et al.

    A new time-domain feature parameter for phoneme classification

  • T.M. Thasleema et al.

    Time-domain non-linear feature parameter for consonant classification

    Int. J. Speech Technol.

    (2012)
  • J. Ye et al.

    Phoneme classification using naive Bayes classifier in reconstructed phase space

  • J. Ye et al.

    Phoneme classification over the reconstructed phase space using principal component analysis

  • X. Liu et al.

    Vowel classification by global dynamic modeling

  • A.C. Lindgren et al.

    Speech recognition using reconstructed phase space features

  • A.C. Lindgren et al.

    Joint frequency domain and reconstructed phase space features for speech recognition

  • R.J. Povinelli et al.

    Time series classification using Gaussian mixture models of reconstructed phase spaces

    IEEE Trans. Knowl. Data Eng.

    (2004)
  • M.T. Johnson et al.

    Time-domain isolated phoneme classification using reconstructed phase spaces

    IEEE Trans. Speech Audio Process.

    (2005)
  • Cited by (17)

    • Improvement of automatic speech recognition systems via nonlinear dynamical features evaluated from the recurrence plot of speech signals

      2017, Computers and Electrical Engineering
      Citation Excerpt :

      In such cases, the frequency domain techniques are deficient, because it is impossible to dissociate the information of such a signal only in the frequency domain [5]. In addition, the speech signal shows some chaotic behaviors due to the existence of nonlinear phenomena such as the turbulence [9]. Studies about dynamical systems and chaos theory resulted in a kind of signal representation, a multi-dimensional trajectory embedded in the reconstructed phase space [4–6].

    • MLP-based isolated phoneme classification using likelihood features extracted from reconstructed phase space

      2015, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      To examine the accuracy of the proposed approach in comparison to other techniques, we conducted an exhaustive series of experiment. We used several different feature vectors (MFCC, LLRPS, or combination of them) and several traditional classifiers that have previously been reported (state space point distribution (SSPD) (Thasleema et al., 2012), reflection-based multivariate linear prediction features (MVLPREF) (Shekofteh and Almasganj, 2013b), multi-layer perceptron (MLP), multiclass support vector machine (SVM) with linear kernel (Chang and Lin, 2011; Hsu and Lin, 2002), and naive Bayes classifier (NBC) with the Gaussian kernel distribution (John and Langley, 1995). We employed these techniques for an isolated phoneme classification task over the FARSDAT database.

    View all citing articles on Scopus

    Yasser Shekofteh received his BS in biomedical engineering and electrical engineering from Amirkabir University of Technology, Tehran, Iran, in 2005 and 2006, respectively. He received his MS in biomedical engineering from Amirkabir University of Technology in 2008. He is currently a PhD candidate in the Biomedical Engineering Department at Amirkabir University of Technology. His research is mainly focused on signal processing, speech recognition, and keyword spotting.

    Farshad Almasganj received his MS in electrical engineering from Amirkabir University of Technology, Tehran, Iran, in 1987 and his PhD in biomedical engineering from Tarbiat Modares University, Tehran, Iran, in 1998. He is currently an associate professor in the Biomedical Engineering Department of Amirkabir University of Technology. His research interests include automatic detection of voice disorders, speech recognition, prosody, and language modeling for ASR systems.

    View full text