A clustering based feature selection method in spectro-temporal domain for speech recognition

doi:10.1016/j.engappai.2012.04.004

Engineering Applications of Artificial Intelligence

Volume 25, Issue 6, September 2012, Pages 1194-1202

https://doi.org/10.1016/j.engappai.2012.04.004 Get rights and content

Abstract

Spectro-temporal representation of speech has become one of the leading signal representation approaches in speech recognition systems in recent years. This representation suffers from high dimensionality of the features space which makes this domain unsuitable for practical speech recognition systems. In this paper, a new clustering based method is proposed for secondary feature selection/extraction in the spectro-temporal domain. In the proposed representation, Gaussian mixture models (GMM) and weighted K-means (WKM) clustering techniques are applied to spectro-temporal domain to reduce the dimensions of the features space. The elements of centroid vectors and covariance matrices of clusters are considered as attributes of the secondary feature vector of each frame. To evaluate the efficiency of the proposed approach, the tests were conducted for new feature vectors on classification of phonemes in main categories of phonemes in TIMIT database. It was shown that by employing the proposed secondary feature vector, a significant improvement was revealed in classification rate of different sets of phonemes comparing with MFCC features. The average achieved improvements in classification rates of voiced plosives comparing to MFCC features is 5.9% using WKM clustering and 6.4% using GMM clustering. The greatest improvement is about 7.4% which is obtained by using WKM clustering in classification of front vowels comparing to MFCC features.

Introduction

One of the determinant issues in the performance of speech recognition systems is the process of acoustic representation of speech signals. Successful examples of audio representations are Mel scaled frequency cepstral coefficients (Davis and Mermelstein, 1980) and spectro-temporal features (Chi et al., 2005, Mesgarani et al., 2006, Mesgarani et al., 2008) that are both inspired from human hearing models. In particular, spectro-temporal features use a simplified model of human brain cortical stage after successful modeling of internal ear functionalities. Although there had been some modeling investigations on internal ear (Yang et al., 1992) and auditory cortical system (Wang and Shamma, 1995) for engineering applications; they had not been employed in engineering applications for ten years. Recently, a computational auditory model has been obtained according to neurology, biology and investigations at various stages of the auditory system of brain (Chi et al., 2005) and has been developed in various applications such as phoneme classification (Mesgarani et al., 2008), voice activity detection (Mesgarani et al., 2006, Valipour et al., 2010), speaker separation (Elhilali and Shamma, 2004, Rigaud et al., 2011), auditory attention (Shamma et al., 2011) and speech enhancement systems (Mesgarani and Shamma, 2005) in recent years. This model has two main stages. In the stage of auditory modeling, an auditory spectrogram is extracted for the input acoustic signal. In the next stage, the spectro-temporal features of speech are extracted by applying a set of two dimensional spectro-temporal receptive field (STRF) filters on the spectrogram. STRF filters are scaled versions of a two dimensional impulse response (Chi et al., 2005). It is observed that modified versions of these features are more robust in noisy environments in comparison to cepstral coefficients (Bouvrie et al., 2008). The main drawback of spectro-temporal analysis is the large number of extracted features which may affect the parameter estimation accuracy in the training phase of a speech classifier. Some methods such as PCA, LDA and neural networks are used to reduce the number of features in spectro-temporal domain (Mesgarani et al., 2006, Meyer and Kollmeier, 2011). These methods are general feature selection methods. Therefore, these methods are not exactly compatible with the speech classification problems. In addition, there are some approaches which try to find out the best 2D impulse response (best scale-best rate) to extract the appropriate features (Mesgarani et al., 2008).

This study is motivated by the clustered behavior of information in the spectro-temporal domain. In fact, the phonemes' information is concentrated in the specific parts of the spectro-temporal features space. In other words, it is desirable to represent the phoneme as the parameters of a number of clusters in the spectro-temporal domain. Some studies have shown that the space of MFCC features is not properly clustered (Kinnunen et al., 2001). It means that this space does not have distinct clusters to represent the short time properties of an uttered speech. Therefore, applying clustering methods to MFCC may be just considered as a kind of space coverage. In contrast, there are some domains that clustering results in better secondary features for signal representation and classification (Yu and Kamarthi, 2010, Ahmed and Mohamad, 2008, Yu et al., 2007). In this study, it is tried to study the effect of clustering to represent the speech in spectro-temporal domain. It will be shown that in the new clustered features space, phonemes are more separable.

There are many clustering methods including Gaussian mixture model (GMM) (Duda et al. 2001), K-means and weighted K-means (WKM) clustering (Kerdprasop et al., 2005), support vector clustering (Ping et al., 2010) and scale space statistics clustering (Sakai and Imiya, 2009). Clustering may be used either as a classification tool for audio and speech signals (Dhanalakshmi et al., 2011) or as a tool to extract and select a set of secondary acoustic features. This study is focused on the second approach. Two clustering methods are studied to reduce the spectro-temporal features into a few effective secondary features for each frame. GMM and WKM clustering algorithms are shown to be useful in many practical image segmentation applications (Blekas et al., 2005, Abras and Ballarin, 2005). Specifically, GMM is a good choice to model irregular data well. Therefore, in this paper, spatial GMM is employed to cluster the feature space as a feature reduction approach. Spatial GMM input vectors include the position attributes in addition to the representation attributes at that point. This makes the vector large and may lead the system to have inaccuracy problems in parameter estimation. To reduce the size of the vectors in the clustering procedure, the vectors should be weighted due to their importance in the representation of the corresponding frame. Therefore, WKM clustering algorithm is investigated as another clustering method which may be useful to cluster the spectro-temporal space.

The organization of the paper is as follows: The spectro-temporal representation is briefly discussed in Section 2. The proposed secondary feature extraction algorithm for phonemes using the behavior of GMM and WKM clusters in spectro-temporal domain is presented in Section 3. The proposed features are experimentally evaluated in the features space and tested on a phoneme classification task in Section 4. The paper is concluded in Section 5.

Section snippets

Spectro-temporal feature representation

The auditory model that is described in this section, is a mathematical model of internal ear and the first layer of auditory brain section that are used for speech processing applications in recent years. The block diagram of the auditory model is shown in Fig. 1.

Block diagram of the auditory model

The output of each branch of the filter-bank may be modeled as $r_{+} (t, f; ω, Ω; θ, φ) = y {(t, f)}_{t, f}^{⁎} {STRF}_{+} (t, f; ω, Ω; θ, φ)$ $r_{-} (t, f; ω, Ω; θ, φ) = y {(t, f)}_{t, f}^{⁎} {STRF}_{-} (t, f; ω, Ω; θ, φ)$ where ω and Ω are the rate and scale parameters of the filters. In addition, θ and φ denote feature characteristic phases that show the degree of asymmetry along time and frequency, respectively. r₊ and r₋ can be defined using auxiliary variables z₊ and z₋ as below (Chi et al., 2005): $r_{+} (t, f; ω, Ω; θ, ϕ) = | z_{+} | \cos (∠ z_{+} - θ - ϕ)$ $r_{-} (t, f; ω, Ω; θ, ϕ) = | z_{-} | \cos (∠ z_{-} + θ + ϕ)$

Overall architecture

The overall architecture of the proposed method is depicted in Fig. 4. As it is shown, in the first stage, the auditory spectrogram of a speech frame was calculated. Then, the spectro-temporal features were extracted using previously described auditory spectrogram and auditory cortex model. The information in the output of the cortical stage for each frame was distributed in the three dimensional space of frequency, rate and scale axes. Because of large dimensions of spectro-temporal features

Experimental setup

To assess the performance of proposed features, a set of tests were conducted on classification of phonemes in main categories of phonemes. Most of experiments are performed on common /b/,/d/,/g/ difficult to process phonemes classifier which is one of hard to discriminate sets of phonemes and it is the benchmark of many studies in this field (Gas et al., 2004, Waibel et al., 1989, Yousefi Azar and Razzazi, 2010), however, after investigating the characteristics of the proposed features, the

Conclusion

In this paper, a new method was presented for spectro-temporal secondary features selection/extraction in order to reduce the features space dimensions. This method is based on clustering in the spectro-temporal domain to extract the main energy concentration points of each acoustic event in the spectro-temporal domain. In the proposed method, the position of clusters is determined using GMM and WKM at each frame of speech. Two types of feature vectors were used for phoneme classification. The

References (31)

P. Dhanalakshmi et al.
Pattern classification models for classifying and indexing audio signals
Eng. Appl. Artif. Intell.
(2011)
B. Gas et al.
Discriminant neural predictive coding applied to phoneme recognition
NeuroComputing
(2004)
B.T. Meyer et al.
Robustness of spectro-temporal features against intrinsic and extrinsic variations in automatic speech recognition
Speech Commun.
(2011)
L. Ping et al.
Improved support vector clustering
Eng. Appl. Artif. Intell.
(2010)
T. Sakai et al.
Unsupervised cluster discovery using statistics in scale space
Eng. Appl. Artif. Intell.
(2009)
S.A. Shamma et al.
Temporal coherence and attention in auditory scene analysis
Trends Neurosci.
(2011)
G. Yu et al.
A cluster-based wavelet feature extraction method and its application
Eng. Appl. Artif. Intell.
(2010)
G.N. Abras et al.
A weighted K-means algorithm applied to brain tissue classification
J. Comput. Sci. Technol. (JCS&T)
(2005)
M.M. Ahmed et al.
Segmentation of brain MR images for tumor extraction by combining K-means clustering and Perona–Malik Anisotropic Diffusion Model
Int. J. Image Process.
(2008)
K. Blekas et al.
A spatially constrained mixture model for image segmentation
IEEE Trans. Neural Networks
(2005)

Bouvrie, J., Ezzat,T., Poggio,T., 2008. Localized spectro-temporal cepstral analysis of speech. Proceedings of...

T. Chi et al.

Multiresolution spectrotemporal analysis of complex sounds

Acoust. Soc. Am. J.

(2005)

S. Davis et al.

Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences

IEEE Trans. Acoust. Speech Signal Process.

(1980)

R.O. Duda et al.

Pattern Classification

(2001)

Elhilali, M., Shamma, S.A, 2004. Adaptive cortical model for auditory streaming and monaural speaker separation....

Cited by (15)

Performing predefined tasks using the human–robot interaction on speech recognition for an industrial robot
2020, Engineering Applications of Artificial Intelligence
Citation Excerpt :
Mel Frequency Cepstral Coefficients (MFCC) calculating algorithm is often used to extract the features of speech signals (Grozdić et al., 2017). Features obtained from speech data are classified by algorithms such as Hidden Markov Model (HMM), Gaussian Mixture Model (GMM) and Deep Neural Networks (DNN) (Amrouche et al., 2010; Esfandian et al., 2012; Ting et al., 2013). In addition to such studies, it has been shown that robots can be controlled with ready speech recognition tools (Du et al., 2018; Zinchenko et al., 2017).
People who are not experts in robotics can easily implement complex robotic applications by using human–robot interaction (HRI). HRI systems require many complex operations such as robot control, image processing, natural speech recognition, and decision making. In this study, interactive control with an industrial robot was performed by using speech recognition software in the Turkish language. The collected voice data were converted to text data by using automatic speech recognition module based on deep neural networks (DNN). The proposed DNN (p-DNN) was compared to classic classification algorithms. Converted text data was improved in another module to select the process to be applied. According to selected process, position data were defined using image processing. The determined position information was sent to the robot using a fuzzy controller. The developed HRI system was implemented on a KUKA KR Agilus KR6 R900 sixx robot manipulator. The word accuracy rate of the p-DNN model was measured as 90.37%. The developed image processing module and fuzzy controller worked with minimal errors. The contribution of this study is that an industrial robot is easily programming using this software by people who are not experts in robotics and know Turkish.
Mutual equidistant-scattering criterion: A new index for crisp clustering
2019, Expert Systems with Applications
Citation Excerpt :
The problem of data grouping or clustering is one of the main tasks in machine learning (Armano & Farmani, 2016; Campello, 2010) and data mining (Moulavi, Jaskowiak, Campello, Zimek, & Sander, 2014) paradigms, prevailing in any discipline involving multivariate data analysis (Jain, 2010). This is formally known as data clustering and has gained a prominent place in many applications in the last decades, especially in speech recognition (Esfandian, Razzazi, & Behrad, 2012), web applications (de Andrade Silva, Hruschka, & Gama, 2017), image processing (Bayá, Larese, & Namías, 2017), outlier detection (Langone, Reynders, Mehrkanoon, & Suykens, 2017) and bioinformatics (Alswaitti, Albughdadi, & Isa, 2018). There is a large variety of clustering algorithms based on a wide range of applications and mathematical formulations (i.e., induction principles) of what researchers believe to be a good cluster definition (Žalik & Žalik, 2011).
Clustering algorithms usually assume that the number K of clusters is known, although there is often no prior knowledge about the underlying set. Consequently, the significance of the defined groups needs to be validated. Cluster validity indexes are commonly used to perform the validation of clustering results. However, most of them are considered to be dependent on the number of data objects and often tend to ignore small and low-density groups. Furthermore, suboptimal clustering solutions are frequently selected when the clusters are in a certain degree of overlapping or low separation. Thus, we propose a new non-parametric internal validity index based on within-cluster mutual equidistant-scattering for crisp clustering. Eight different validity indexes were analysed to detect the number of clusters in a data set. Experiments on both synthetic and real-world data show the effectiveness and reliability of our approach to evaluate the hyperparameter K.
Beyond cross-domain learning: Multiple-domain nonnegative matrix factorization
2014, Engineering Applications of Artificial Intelligence
Citation Excerpt :
The experiments on two challenging multiple-domain learning tasks illumined the effectiveness and efficiency of the proposed method. In the further, we will develop novel multiple-domain data representation method by extending other representation methodologies to multiple-domains, such as multiple-domain sparse coding (Gao et al., 2011, 2013), multiple-domain hashing (Wang et al., 2010, 2012b), multiple-domain feature selection (Keynia, 2012; Esfandian et al., 2012; Chatterjee and Bhattacherjee, 2011), and multiple-domain component analysis (Aradhya et al., 2008). Moreover, we could also directly conduct the multiple-domain learning at the level of classification or ranking to propose the multiple-domain SVM (Abdi and Giveki, 2013; Wang et al., 2013c) or multiple-domain database ranking (Yang et al., 2012), by reducing the multiple-domain mismatch of distributions of classification prediction results or ranking scores.
Traditional cross-domain learning methods transfer learning from a source domain to a target domain. In this paper, we propose the multiple-domain learning problem for several equally treated domains. The multiple-domain learning problem assumes that samples from different domains have different distributions, but share the same feature and class label spaces. Each domain could be a target domain, while also be a source domain for other domains. A novel multiple-domain representation method is proposed for the multiple-domain learning problem. This method is based on nonnegative matrix factorization (NMF), and tries to learn a basis matrix and coding vectors for samples, so that the domain distribution mismatch among different domains will be reduced under an extended variation of the maximum mean discrepancy (MMD) criterion. The novel algorithm — multiple-domain NMF (MDNMF) — was evaluated on two challenging multiple-domain learning problems — multiple user spam email detection and multiple-domain glioma diagnosis. The effectiveness of the proposed algorithm is experimentally verified.
A hybrid approach combining extreme learning machine and sparse representation for image classification
2014, Engineering Applications of Artificial Intelligence
Citation Excerpt :
Our future work will focus on the following aspects: (a) employ advanced visual features by clustering based method (Esfandian et al., 2012) and generic Fourier descriptor (Zhang and Lu, 2002) to achieve more robust performance; (b) find a more effective criteria using fuzzy logic (Cazarez-Castro et al., 2012) to estimate ELM misclassified images; (c) combine modified ELM classifier (Deng et al., 2009; Horata et al., 2013) and improved SRC classifier (Guo et al., 2013; Zang and Zhang, 2011).
Two well-known techniques, extreme learning machine (ELM) and sparse representation based classification (SRC) method, have attracted significant attention due to their respective performance characteristics in computer vision and pattern recognition. In general, ELM has speed advantage and SRC has accuracy advantage. However, there also remain drawbacks that limit their practical application. Actually, in the field of image classification, ELM performs extremely fast while it cannot handle noise well, whereas SRC shows notable robustness to noise while it suffers high computational cost. In order to incorporate their respective advantages and also overcome their respective drawbacks, this work proposes a novel hybrid approach combining ELM and SRC for image classification. The new approach is applied to handwritten digit classification and face recognition, experiments results demonstrate that it not only outperforms ELM in classification accuracy but also has much less computational complexity than SRC.
A novel feature selection method using generalized inverted Dirichlet-based HMMs for image categorization
2022, International Journal of Machine Learning and Cybernetics
A Case Study on Handwritten Indic Script Classification: Benchmarking of the Results at Page, Block, Text-line, and Word Levels
2022, ACM Transactions on Asian and Low-Resource Language Information Processing

View all citing articles on Scopus

View full text

A clustering based feature selection method in spectro-temporal domain for speech recognition

Abstract

Introduction

Section snippets

Spectro-temporal feature representation

Block diagram of the auditory model

Overall architecture

Experimental setup

Conclusion

Eng. Appl. Artif. Intell.

NeuroComputing

Speech Commun.

Eng. Appl. Artif. Intell.

Eng. Appl. Artif. Intell.

Trends Neurosci.

Eng. Appl. Artif. Intell.

A weighted K-means algorithm applied to brain tissue classification

J. Comput. Sci. Technol. (JCS&T)

Segmentation of brain MR images for tumor extraction by combining K-means clustering and Perona–Malik Anisotropic Diffusion Model

Int. J. Image Process.

A spatially constrained mixture model for image segmentation

IEEE Trans. Neural Networks

Multiresolution spectrotemporal analysis of complex sounds

Acoust. Soc. Am. J.

Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences

IEEE Trans. Acoust. Speech Signal Process.

Pattern Classification