A clustering based feature selection method in spectro-temporal domain for speech recognition

https://doi.org/10.1016/j.engappai.2012.04.004Get rights and content

Abstract

Spectro-temporal representation of speech has become one of the leading signal representation approaches in speech recognition systems in recent years. This representation suffers from high dimensionality of the features space which makes this domain unsuitable for practical speech recognition systems. In this paper, a new clustering based method is proposed for secondary feature selection/extraction in the spectro-temporal domain. In the proposed representation, Gaussian mixture models (GMM) and weighted K-means (WKM) clustering techniques are applied to spectro-temporal domain to reduce the dimensions of the features space. The elements of centroid vectors and covariance matrices of clusters are considered as attributes of the secondary feature vector of each frame. To evaluate the efficiency of the proposed approach, the tests were conducted for new feature vectors on classification of phonemes in main categories of phonemes in TIMIT database. It was shown that by employing the proposed secondary feature vector, a significant improvement was revealed in classification rate of different sets of phonemes comparing with MFCC features. The average achieved improvements in classification rates of voiced plosives comparing to MFCC features is 5.9% using WKM clustering and 6.4% using GMM clustering. The greatest improvement is about 7.4% which is obtained by using WKM clustering in classification of front vowels comparing to MFCC features.

Introduction

One of the determinant issues in the performance of speech recognition systems is the process of acoustic representation of speech signals. Successful examples of audio representations are Mel scaled frequency cepstral coefficients (Davis and Mermelstein, 1980) and spectro-temporal features (Chi et al., 2005, Mesgarani et al., 2006, Mesgarani et al., 2008) that are both inspired from human hearing models. In particular, spectro-temporal features use a simplified model of human brain cortical stage after successful modeling of internal ear functionalities. Although there had been some modeling investigations on internal ear (Yang et al., 1992) and auditory cortical system (Wang and Shamma, 1995) for engineering applications; they had not been employed in engineering applications for ten years. Recently, a computational auditory model has been obtained according to neurology, biology and investigations at various stages of the auditory system of brain (Chi et al., 2005) and has been developed in various applications such as phoneme classification (Mesgarani et al., 2008), voice activity detection (Mesgarani et al., 2006, Valipour et al., 2010), speaker separation (Elhilali and Shamma, 2004, Rigaud et al., 2011), auditory attention (Shamma et al., 2011) and speech enhancement systems (Mesgarani and Shamma, 2005) in recent years. This model has two main stages. In the stage of auditory modeling, an auditory spectrogram is extracted for the input acoustic signal. In the next stage, the spectro-temporal features of speech are extracted by applying a set of two dimensional spectro-temporal receptive field (STRF) filters on the spectrogram. STRF filters are scaled versions of a two dimensional impulse response (Chi et al., 2005). It is observed that modified versions of these features are more robust in noisy environments in comparison to cepstral coefficients (Bouvrie et al., 2008). The main drawback of spectro-temporal analysis is the large number of extracted features which may affect the parameter estimation accuracy in the training phase of a speech classifier. Some methods such as PCA, LDA and neural networks are used to reduce the number of features in spectro-temporal domain (Mesgarani et al., 2006, Meyer and Kollmeier, 2011). These methods are general feature selection methods. Therefore, these methods are not exactly compatible with the speech classification problems. In addition, there are some approaches which try to find out the best 2D impulse response (best scale-best rate) to extract the appropriate features (Mesgarani et al., 2008).

This study is motivated by the clustered behavior of information in the spectro-temporal domain. In fact, the phonemes' information is concentrated in the specific parts of the spectro-temporal features space. In other words, it is desirable to represent the phoneme as the parameters of a number of clusters in the spectro-temporal domain. Some studies have shown that the space of MFCC features is not properly clustered (Kinnunen et al., 2001). It means that this space does not have distinct clusters to represent the short time properties of an uttered speech. Therefore, applying clustering methods to MFCC may be just considered as a kind of space coverage. In contrast, there are some domains that clustering results in better secondary features for signal representation and classification (Yu and Kamarthi, 2010, Ahmed and Mohamad, 2008, Yu et al., 2007). In this study, it is tried to study the effect of clustering to represent the speech in spectro-temporal domain. It will be shown that in the new clustered features space, phonemes are more separable.

There are many clustering methods including Gaussian mixture model (GMM) (Duda et al. 2001), K-means and weighted K-means (WKM) clustering (Kerdprasop et al., 2005), support vector clustering (Ping et al., 2010) and scale space statistics clustering (Sakai and Imiya, 2009). Clustering may be used either as a classification tool for audio and speech signals (Dhanalakshmi et al., 2011) or as a tool to extract and select a set of secondary acoustic features. This study is focused on the second approach. Two clustering methods are studied to reduce the spectro-temporal features into a few effective secondary features for each frame. GMM and WKM clustering algorithms are shown to be useful in many practical image segmentation applications (Blekas et al., 2005, Abras and Ballarin, 2005). Specifically, GMM is a good choice to model irregular data well. Therefore, in this paper, spatial GMM is employed to cluster the feature space as a feature reduction approach. Spatial GMM input vectors include the position attributes in addition to the representation attributes at that point. This makes the vector large and may lead the system to have inaccuracy problems in parameter estimation. To reduce the size of the vectors in the clustering procedure, the vectors should be weighted due to their importance in the representation of the corresponding frame. Therefore, WKM clustering algorithm is investigated as another clustering method which may be useful to cluster the spectro-temporal space.

The organization of the paper is as follows: The spectro-temporal representation is briefly discussed in Section 2. The proposed secondary feature extraction algorithm for phonemes using the behavior of GMM and WKM clusters in spectro-temporal domain is presented in Section 3. The proposed features are experimentally evaluated in the features space and tested on a phoneme classification task in Section 4. The paper is concluded in Section 5.

Section snippets

Spectro-temporal feature representation

The auditory model that is described in this section, is a mathematical model of internal ear and the first layer of auditory brain section that are used for speech processing applications in recent years. The block diagram of the auditory model is shown in Fig. 1.

Block diagram of the auditory model

The output of each branch of the filter-bank may be modeled asr+(t,f;ω,Ω;θ,φ)=y(t,f)t,fSTRF+(t,f;ω,Ω;θ,φ)r(t,f;ω,Ω;θ,φ)=y(t,f)t,fSTRF(t,f;ω,Ω;θ,φ)where ω and Ω are the rate and scale parameters of the filters. In addition, θ and φ denote feature characteristic phases that show the degree of asymmetry along time and frequency, respectively. r+ and r can be defined using auxiliary variables z+ and z as below (Chi et al., 2005):r+(t,f;ω,Ω;θ,ϕ)=|z+|cos(z+θϕ)r(t,f;ω,Ω;θ,ϕ)=|z|cos(z+θ+ϕ)

Overall architecture

The overall architecture of the proposed method is depicted in Fig. 4. As it is shown, in the first stage, the auditory spectrogram of a speech frame was calculated. Then, the spectro-temporal features were extracted using previously described auditory spectrogram and auditory cortex model. The information in the output of the cortical stage for each frame was distributed in the three dimensional space of frequency, rate and scale axes. Because of large dimensions of spectro-temporal features

Experimental setup

To assess the performance of proposed features, a set of tests were conducted on classification of phonemes in main categories of phonemes. Most of experiments are performed on common /b/,/d/,/g/ difficult to process phonemes classifier which is one of hard to discriminate sets of phonemes and it is the benchmark of many studies in this field (Gas et al., 2004, Waibel et al., 1989, Yousefi Azar and Razzazi, 2010), however, after investigating the characteristics of the proposed features, the

Conclusion

In this paper, a new method was presented for spectro-temporal secondary features selection/extraction in order to reduce the features space dimensions. This method is based on clustering in the spectro-temporal domain to extract the main energy concentration points of each acoustic event in the spectro-temporal domain. In the proposed method, the position of clusters is determined using GMM and WKM at each frame of speech. Two types of feature vectors were used for phoneme classification. The

References (31)

  • Bouvrie, J., Ezzat,T., Poggio,T., 2008. Localized spectro-temporal cepstral analysis of speech. Proceedings of...
  • T. Chi et al.

    Multiresolution spectrotemporal analysis of complex sounds

    Acoust. Soc. Am. J.

    (2005)
  • S. Davis et al.

    Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences

    IEEE Trans. Acoust. Speech Signal Process.

    (1980)
  • R.O. Duda et al.

    Pattern Classification

    (2001)
  • Elhilali, M., Shamma, S.A, 2004. Adaptive cortical model for auditory streaming and monaural speaker separation....
  • Cited by (15)

    • Performing predefined tasks using the human–robot interaction on speech recognition for an industrial robot

      2020, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      Mel Frequency Cepstral Coefficients (MFCC) calculating algorithm is often used to extract the features of speech signals (Grozdić et al., 2017). Features obtained from speech data are classified by algorithms such as Hidden Markov Model (HMM), Gaussian Mixture Model (GMM) and Deep Neural Networks (DNN) (Amrouche et al., 2010; Esfandian et al., 2012; Ting et al., 2013). In addition to such studies, it has been shown that robots can be controlled with ready speech recognition tools (Du et al., 2018; Zinchenko et al., 2017).

    • Mutual equidistant-scattering criterion: A new index for crisp clustering

      2019, Expert Systems with Applications
      Citation Excerpt :

      The problem of data grouping or clustering is one of the main tasks in machine learning (Armano & Farmani, 2016; Campello, 2010) and data mining (Moulavi, Jaskowiak, Campello, Zimek, & Sander, 2014) paradigms, prevailing in any discipline involving multivariate data analysis (Jain, 2010). This is formally known as data clustering and has gained a prominent place in many applications in the last decades, especially in speech recognition (Esfandian, Razzazi, & Behrad, 2012), web applications (de Andrade Silva, Hruschka, & Gama, 2017), image processing (Bayá, Larese, & Namías, 2017), outlier detection (Langone, Reynders, Mehrkanoon, & Suykens, 2017) and bioinformatics (Alswaitti, Albughdadi, & Isa, 2018). There is a large variety of clustering algorithms based on a wide range of applications and mathematical formulations (i.e., induction principles) of what researchers believe to be a good cluster definition (Žalik & Žalik, 2011).

    • Beyond cross-domain learning: Multiple-domain nonnegative matrix factorization

      2014, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      The experiments on two challenging multiple-domain learning tasks illumined the effectiveness and efficiency of the proposed method. In the further, we will develop novel multiple-domain data representation method by extending other representation methodologies to multiple-domains, such as multiple-domain sparse coding (Gao et al., 2011, 2013), multiple-domain hashing (Wang et al., 2010, 2012b), multiple-domain feature selection (Keynia, 2012; Esfandian et al., 2012; Chatterjee and Bhattacherjee, 2011), and multiple-domain component analysis (Aradhya et al., 2008). Moreover, we could also directly conduct the multiple-domain learning at the level of classification or ranking to propose the multiple-domain SVM (Abdi and Giveki, 2013; Wang et al., 2013c) or multiple-domain database ranking (Yang et al., 2012), by reducing the multiple-domain mismatch of distributions of classification prediction results or ranking scores.

    • A hybrid approach combining extreme learning machine and sparse representation for image classification

      2014, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      Our future work will focus on the following aspects: (a) employ advanced visual features by clustering based method (Esfandian et al., 2012) and generic Fourier descriptor (Zhang and Lu, 2002) to achieve more robust performance; (b) find a more effective criteria using fuzzy logic (Cazarez-Castro et al., 2012) to estimate ELM misclassified images; (c) combine modified ELM classifier (Deng et al., 2009; Horata et al., 2013) and improved SRC classifier (Guo et al., 2013; Zang and Zhang, 2011).

    View all citing articles on Scopus
    View full text