Engineering Applications of Artificial Intelligence
A clustering based feature selection method in spectro-temporal domain for speech recognition
Introduction
One of the determinant issues in the performance of speech recognition systems is the process of acoustic representation of speech signals. Successful examples of audio representations are Mel scaled frequency cepstral coefficients (Davis and Mermelstein, 1980) and spectro-temporal features (Chi et al., 2005, Mesgarani et al., 2006, Mesgarani et al., 2008) that are both inspired from human hearing models. In particular, spectro-temporal features use a simplified model of human brain cortical stage after successful modeling of internal ear functionalities. Although there had been some modeling investigations on internal ear (Yang et al., 1992) and auditory cortical system (Wang and Shamma, 1995) for engineering applications; they had not been employed in engineering applications for ten years. Recently, a computational auditory model has been obtained according to neurology, biology and investigations at various stages of the auditory system of brain (Chi et al., 2005) and has been developed in various applications such as phoneme classification (Mesgarani et al., 2008), voice activity detection (Mesgarani et al., 2006, Valipour et al., 2010), speaker separation (Elhilali and Shamma, 2004, Rigaud et al., 2011), auditory attention (Shamma et al., 2011) and speech enhancement systems (Mesgarani and Shamma, 2005) in recent years. This model has two main stages. In the stage of auditory modeling, an auditory spectrogram is extracted for the input acoustic signal. In the next stage, the spectro-temporal features of speech are extracted by applying a set of two dimensional spectro-temporal receptive field (STRF) filters on the spectrogram. STRF filters are scaled versions of a two dimensional impulse response (Chi et al., 2005). It is observed that modified versions of these features are more robust in noisy environments in comparison to cepstral coefficients (Bouvrie et al., 2008). The main drawback of spectro-temporal analysis is the large number of extracted features which may affect the parameter estimation accuracy in the training phase of a speech classifier. Some methods such as PCA, LDA and neural networks are used to reduce the number of features in spectro-temporal domain (Mesgarani et al., 2006, Meyer and Kollmeier, 2011). These methods are general feature selection methods. Therefore, these methods are not exactly compatible with the speech classification problems. In addition, there are some approaches which try to find out the best 2D impulse response (best scale-best rate) to extract the appropriate features (Mesgarani et al., 2008).
This study is motivated by the clustered behavior of information in the spectro-temporal domain. In fact, the phonemes' information is concentrated in the specific parts of the spectro-temporal features space. In other words, it is desirable to represent the phoneme as the parameters of a number of clusters in the spectro-temporal domain. Some studies have shown that the space of MFCC features is not properly clustered (Kinnunen et al., 2001). It means that this space does not have distinct clusters to represent the short time properties of an uttered speech. Therefore, applying clustering methods to MFCC may be just considered as a kind of space coverage. In contrast, there are some domains that clustering results in better secondary features for signal representation and classification (Yu and Kamarthi, 2010, Ahmed and Mohamad, 2008, Yu et al., 2007). In this study, it is tried to study the effect of clustering to represent the speech in spectro-temporal domain. It will be shown that in the new clustered features space, phonemes are more separable.
There are many clustering methods including Gaussian mixture model (GMM) (Duda et al. 2001), K-means and weighted K-means (WKM) clustering (Kerdprasop et al., 2005), support vector clustering (Ping et al., 2010) and scale space statistics clustering (Sakai and Imiya, 2009). Clustering may be used either as a classification tool for audio and speech signals (Dhanalakshmi et al., 2011) or as a tool to extract and select a set of secondary acoustic features. This study is focused on the second approach. Two clustering methods are studied to reduce the spectro-temporal features into a few effective secondary features for each frame. GMM and WKM clustering algorithms are shown to be useful in many practical image segmentation applications (Blekas et al., 2005, Abras and Ballarin, 2005). Specifically, GMM is a good choice to model irregular data well. Therefore, in this paper, spatial GMM is employed to cluster the feature space as a feature reduction approach. Spatial GMM input vectors include the position attributes in addition to the representation attributes at that point. This makes the vector large and may lead the system to have inaccuracy problems in parameter estimation. To reduce the size of the vectors in the clustering procedure, the vectors should be weighted due to their importance in the representation of the corresponding frame. Therefore, WKM clustering algorithm is investigated as another clustering method which may be useful to cluster the spectro-temporal space.
The organization of the paper is as follows: The spectro-temporal representation is briefly discussed in Section 2. The proposed secondary feature extraction algorithm for phonemes using the behavior of GMM and WKM clusters in spectro-temporal domain is presented in Section 3. The proposed features are experimentally evaluated in the features space and tested on a phoneme classification task in Section 4. The paper is concluded in Section 5.
Section snippets
Spectro-temporal feature representation
The auditory model that is described in this section, is a mathematical model of internal ear and the first layer of auditory brain section that are used for speech processing applications in recent years. The block diagram of the auditory model is shown in Fig. 1.
Block diagram of the auditory model
The output of each branch of the filter-bank may be modeled aswhere ω and Ω are the rate and scale parameters of the filters. In addition, θ and φ denote feature characteristic phases that show the degree of asymmetry along time and frequency, respectively. r+ and r− can be defined using auxiliary variables z+ and z− as below (Chi et al., 2005):
Overall architecture
The overall architecture of the proposed method is depicted in Fig. 4. As it is shown, in the first stage, the auditory spectrogram of a speech frame was calculated. Then, the spectro-temporal features were extracted using previously described auditory spectrogram and auditory cortex model. The information in the output of the cortical stage for each frame was distributed in the three dimensional space of frequency, rate and scale axes. Because of large dimensions of spectro-temporal features
Experimental setup
To assess the performance of proposed features, a set of tests were conducted on classification of phonemes in main categories of phonemes. Most of experiments are performed on common /b/,/d/,/g/ difficult to process phonemes classifier which is one of hard to discriminate sets of phonemes and it is the benchmark of many studies in this field (Gas et al., 2004, Waibel et al., 1989, Yousefi Azar and Razzazi, 2010), however, after investigating the characteristics of the proposed features, the
Conclusion
In this paper, a new method was presented for spectro-temporal secondary features selection/extraction in order to reduce the features space dimensions. This method is based on clustering in the spectro-temporal domain to extract the main energy concentration points of each acoustic event in the spectro-temporal domain. In the proposed method, the position of clusters is determined using GMM and WKM at each frame of speech. Two types of feature vectors were used for phoneme classification. The
References (31)
- et al.
Pattern classification models for classifying and indexing audio signals
Eng. Appl. Artif. Intell.
(2011) - et al.
Discriminant neural predictive coding applied to phoneme recognition
NeuroComputing
(2004) - et al.
Robustness of spectro-temporal features against intrinsic and extrinsic variations in automatic speech recognition
Speech Commun.
(2011) - et al.
Improved support vector clustering
Eng. Appl. Artif. Intell.
(2010) - et al.
Unsupervised cluster discovery using statistics in scale space
Eng. Appl. Artif. Intell.
(2009) - et al.
Temporal coherence and attention in auditory scene analysis
Trends Neurosci.
(2011) - et al.
A cluster-based wavelet feature extraction method and its application
Eng. Appl. Artif. Intell.
(2010) - et al.
A weighted K-means algorithm applied to brain tissue classification
J. Comput. Sci. Technol. (JCS&T)
(2005) - et al.
Segmentation of brain MR images for tumor extraction by combining K-means clustering and Perona–Malik Anisotropic Diffusion Model
Int. J. Image Process.
(2008) - et al.
A spatially constrained mixture model for image segmentation
IEEE Trans. Neural Networks
(2005)
Multiresolution spectrotemporal analysis of complex sounds
Acoust. Soc. Am. J.
Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences
IEEE Trans. Acoust. Speech Signal Process.
Pattern Classification
Cited by (15)
Performing predefined tasks using the human–robot interaction on speech recognition for an industrial robot
2020, Engineering Applications of Artificial IntelligenceCitation Excerpt :Mel Frequency Cepstral Coefficients (MFCC) calculating algorithm is often used to extract the features of speech signals (Grozdić et al., 2017). Features obtained from speech data are classified by algorithms such as Hidden Markov Model (HMM), Gaussian Mixture Model (GMM) and Deep Neural Networks (DNN) (Amrouche et al., 2010; Esfandian et al., 2012; Ting et al., 2013). In addition to such studies, it has been shown that robots can be controlled with ready speech recognition tools (Du et al., 2018; Zinchenko et al., 2017).
Mutual equidistant-scattering criterion: A new index for crisp clustering
2019, Expert Systems with ApplicationsCitation Excerpt :The problem of data grouping or clustering is one of the main tasks in machine learning (Armano & Farmani, 2016; Campello, 2010) and data mining (Moulavi, Jaskowiak, Campello, Zimek, & Sander, 2014) paradigms, prevailing in any discipline involving multivariate data analysis (Jain, 2010). This is formally known as data clustering and has gained a prominent place in many applications in the last decades, especially in speech recognition (Esfandian, Razzazi, & Behrad, 2012), web applications (de Andrade Silva, Hruschka, & Gama, 2017), image processing (Bayá, Larese, & Namías, 2017), outlier detection (Langone, Reynders, Mehrkanoon, & Suykens, 2017) and bioinformatics (Alswaitti, Albughdadi, & Isa, 2018). There is a large variety of clustering algorithms based on a wide range of applications and mathematical formulations (i.e., induction principles) of what researchers believe to be a good cluster definition (Žalik & Žalik, 2011).
Beyond cross-domain learning: Multiple-domain nonnegative matrix factorization
2014, Engineering Applications of Artificial IntelligenceCitation Excerpt :The experiments on two challenging multiple-domain learning tasks illumined the effectiveness and efficiency of the proposed method. In the further, we will develop novel multiple-domain data representation method by extending other representation methodologies to multiple-domains, such as multiple-domain sparse coding (Gao et al., 2011, 2013), multiple-domain hashing (Wang et al., 2010, 2012b), multiple-domain feature selection (Keynia, 2012; Esfandian et al., 2012; Chatterjee and Bhattacherjee, 2011), and multiple-domain component analysis (Aradhya et al., 2008). Moreover, we could also directly conduct the multiple-domain learning at the level of classification or ranking to propose the multiple-domain SVM (Abdi and Giveki, 2013; Wang et al., 2013c) or multiple-domain database ranking (Yang et al., 2012), by reducing the multiple-domain mismatch of distributions of classification prediction results or ranking scores.
A hybrid approach combining extreme learning machine and sparse representation for image classification
2014, Engineering Applications of Artificial IntelligenceCitation Excerpt :Our future work will focus on the following aspects: (a) employ advanced visual features by clustering based method (Esfandian et al., 2012) and generic Fourier descriptor (Zhang and Lu, 2002) to achieve more robust performance; (b) find a more effective criteria using fuzzy logic (Cazarez-Castro et al., 2012) to estimate ELM misclassified images; (c) combine modified ELM classifier (Deng et al., 2009; Horata et al., 2013) and improved SRC classifier (Guo et al., 2013; Zang and Zhang, 2011).
A novel feature selection method using generalized inverted Dirichlet-based HMMs for image categorization
2022, International Journal of Machine Learning and CyberneticsA Case Study on Handwritten Indic Script Classification: Benchmarking of the Results at Page, Block, Text-line, and Word Levels
2022, ACM Transactions on Asian and Low-Resource Language Information Processing