Reduced feature-set based parallel CHMM speech recognition systems

doi:10.1016/S0020-0255(03)00162-2

Information Sciences

Volume 156, Issues 1–2, 1 November 2003, Pages 21-38

https://doi.org/10.1016/S0020-0255(03)00162-2 Get rights and content

Abstract

This paper presents the multi-streams paradigm as a technique for improving speech signal feature set design and as a performance booster for speech recognition systems, based on the continuous-density hidden Markov model (CHMM) framework. In the multi-streams paradigm we are dealing with different feature sets independently to estimate the same task, and then combining their results at a suitable stage. This paradigm combines the strengths of many varied feature vectors to attain better statistical estimation. Under the proposed paradigm the feature vectors are split into three independent streams, and each stream is used to model an independent CHMM. Then the outcomes of these models, when subjected to any speech input, are merged under a certain strategy. This technique alleviates the dominance effect of the features, and reduces the dimensionality of the feature vectors used in each model. The F-ratio technique is used to further reduce the dimensionality of each stream. Experimental results on different datasets show superiority of the developed paradigm over the corresponding single-stream baseline.

Introduction

Speaker-independent speech recognition systems have many parameters to optimise during the implementation course. There are vast uncertainties to deal with, coming from varied production behaviour of different speakers. Statistical approaches using HMM show superiority over the other techniques to capture and model the features that are carrying the spoken information. The HMM framework interprets the speech signals to changeable-duration sequence of events called states [25]. The performance of the HMM model in discriminating the acoustic classes is highly affected by the observation feature vectors. They are considered as abstract mappings of the highly redundant speech samples. The feature vectors needed to be as short as possible in terms of dimensions which imply redundancy removal, and contain as much as possible of linguistic information. The selected features must assure fast training and recognition procedures, as well as superior acoustic class discrimination. The feature vectors have been widely investigated, and many suggestions have been proposed to reach the ultimate optimality goal of good abstraction and representation. The current approaches rely mainly on the successful Mel frequency cepstral coefficients (MFCCs) vectors to represent the speech samples. Other types of features, different from the MFCCs, have also been introduced and have some strength in certain applications. No feature set can be decided as the best absolute performer under all environmental conditions, in the automatic speech recognition (ASR) systems. One solution to exploit the strengths of the different feature sets is to combine them deliberately under a suitable paradigm. The combination of the features can be done at several points within the ASR structure. The features can be concatenated at the very beginning stage in the feature streams domain and presented to the general classifier, or left as they are in different independent streams and presented to separate classifiers. Also, the outcomes of the classifiers can be merged then presented to the general HMM decoder, or left as they are and presented to separate HMM decoders. The two main questions that needed to be answered in any multi-stream based ASR system design are: What feature set to stream? And, where to merge? There is no agreed analytical procedure to answer these queries, and they have mainly heuristic oriented solutions. However, there are some trends in using statistical notions to help in some decisions. The conditional mutual information (CMI) is used to predict which feature streams will merge most advantageously, and which of the many possible-merging strategies will be most successful, it answers the first question. The CMI of the raw feature streams is supposed to help in deciding whether to merge them together as one large stream, or to feed them separately into independent classifiers for later merging [7], [8]. The results of the CMI technique are not very encouraging as reported.

The important property regarding the feature streams nature is that combining a number of diverse feature streams often improves the recognition performance, and the greatest benefits come from combinations between the most diverse features [30]. Different front-end structures can be used to maximise the feature stream diversity. Combining perceptual linear predictive (PLP) features with the modulation-filtered spectrogram (MSG) features improves the recognition rate significantly [9]. In fact, any change in the feature vectors preparation procedure will lead to an improvement in the recognition rate. Bella et al. [4] found that the combination of nearly identical sets of features with only difference in frame rate, which was set between 80 and 125 frames per second, was enough to introduce some decorrelation between the errors in the streams. The merged system performed significantly better than any one of the component streams. Variable frame rate is also useful in single stream ASR system. In this case, the frame rate is increased for rapidly changing segments with relatively high energy, and reduced for steady-state segments [31].

The classifiers are either Gaussian mixture models (GMM) or neural networks (NNs). Hybridising HMM with NN is widely used in single-stream structure for continuous speech recognition systems [22]. HMM speech recognition systems typically use GMM. The NNs are becoming popular in the multi-stream paradigm, due to their potential in estimating the probability functions and classifications. Merging the streams after the classification stage (posterior merging), rather than feature concatenation, ameliorates the recognition rate one step further [7], [8], [26]. The classifiers outcomes might be merged and decorrelated first, then presented to a GMM of a classical HMM decoder for better recognition performance [9]. The multi-estimation notion is also applicable to the NN based systems. It has been shown that recognition performance can be improved by using the same feature sets to train two NNs, with different initialisation points [19].

The other interesting approach in multi-streams research trends comes from sub-band notion. Rather than deriving the probability streams from completely different acoustic representations, it is also possible to divide a single representation into disjoint regions across the spectrum. Each of the sub-bands can then be used as the basis for separate probability estimators. The output of these estimators can be combined, either by averaging the log posterior probabilities for each class, or by using more complex methods including multi-layer perceptrons or weighted combinations [5], [6]. More specifically, in the sub-band technique, the whole frequency band of the speech signal is split into several sub-bands. Then, each of these sub-bands is processed independently, mostly by a hybrid HMM neural network model. This technique is based on the assumption of sub-band independence, which is not very true, as in reality there is dependency between them. Next, the sub-band outcomes are recombined at several stages during the utterance period according to certain criteria. The main advantage of this approach is the robustness of the recogniser to selective narrow-band noise [21]. This technique is also adopted in random field modelling to model the hidden states of HMM [12].

The multi-streaming is also viewed from another perspective by splitting the feature vectors into a specified number of sub-vectors, which are then processed by different quantizers, and a vector of discrete values with the same length as the number of sub-vectors is the input to the discrete recogniser [28]. This system was experimented with 9, 15, 24, and 39 sub-vectors, and it showed improvement in recognition rate as compared with the conventional CHMM.

The multi-stream approach was also investigated from the recognition rate in noisy environment perspective, and it showed substantial improvement in recognition performance under different noise sources [27].

In this paper, we will deal with the Mels coefficients and their first and second derivatives, as three independent streams. These streams have some sort of dependency, as it is obvious from the way of producing them. However, they showed enhanced effectiveness on the recognition rate when they were dealt with as independent. Thus, the feature vectors adopted comprise 39 coefficients (12 Mels and one power coefficient with their first and second derivatives) per observation; equally divided between three streams. Then we will reduce the dimensionality of each stream, using the F-ratio technique as a figure of merit. This reduction leaves only 28 MFCCs per observation vector to be used in our ASR system, instead of the original 39 coefficients.

This paper is organised as follows. Section 2 briefly demonstrates some related feature vectors designs. Section 3 describes the F-ratio as a figure of merit to assess the importance of the features, and how it can be directly applied on the HMM parameters. Section 4 explains the parallel HMM notion and the dimensionality-reduction application. Section 5 evaluates the performance of ASR systems based on different paradigms. The conclusions will be summarised in Section 6.

Section snippets

Feature vector design based on static and dynamic coefficients

The current approaches rely mainly on the successful Mel frequency cepstral coefficients (MFCCs) vectors to represent each 10–50 ms window of speech samples, taken each 5–25 ms, by a single vector of certain dimension. The window length and rate as well as the feature vectors dimension are decided according to the application task. For many applications the most effective components of the Mel scale features are the first 12 coefficients (excluding the zero coefficient), which are also called

Dimensionality reduction based on the F-ratio figures

The F-ratio is a measure that can be used to evaluate the effectiveness of a particular feature. It has been widely used as a figure of merit for feature selection in speaker recognition applications [24], [29]. It is defined as the ratio of the between-class variance (B) and the within-class variance (W). In the contest of feature selection for pattern classification, the F-ratio can be considered as a strong catalyst to select the features that maximise the separation between different

Parallel HMM multi-streams-based system

We developed this system based on the multi-stream notion, targeting the advantages of alleviating the dominance problem, the dimensionality reduction and flexibility of the design [3]. The speech signal feature vectors selected are the power and Mel-scale coefficients with their first and second derivatives, deltas. This selection is due to the high potential of these coefficients in carrying the static and temporal information of the spoken signals. The first derivative can be approximated by

Comparative studies of different ASR system paradigms

In this section, we depict the recognition rate of some successful ASR systems. The feature vectors used in all the systems are of 28 MFCCs and in the proportion specified in the previous section. The paradigms will be denoted by the following names for simplicity:

ASR-1:
is a single stream multi-mixture model of type CHMM with nine states and five mixtures.
ASR-2:
is a multi-stream ASR with equal stream merging weights. Three streams are used; each of them is modelled by a five mixtures CHMM with nine states.

Conclusions

In this paper, we have investigated the problem of improving the speech recognition performance by restructuring the method of using the feature vectors. Instead of dealing with the composite static and dynamic speech signal features based on the MFCCs as a single stream, we proposed splitting them into three independent streams. Despite the fact that the streams’ independence assumption is not very precise, from the method of deriving their vectors, the multi-streams paradigm has outperformed

Acknowledgements

The authors would like to thank Professor Garry Tee for his constructive comments on this paper. This research is funded by The University of Auckland Research Fund under project number UARF 3602239/9273 and FRST of New Zealand, grant NERF AUT02/001.

References (31)

K.K. Paliwal
Dimensionality reduction of the enhanced feature set for the HMM-based speech recognizer
Digital Signal Processing
(1992)
W.H. Abdulla, N.K. Kasabov, Speech recognition enhancement via robust CHMM speech background discrimination, in: Proc....
W.H. Abdulla, N.K. Kasabov, Two pass hidden Markov model for speech recognition systems, in: Proceedings of ICICS’99,...
W.H. Abdulla, N.K. Kasabov, Feature selection for parallel CHMM speech recognition systems, in: Proceedings of the...
J. Billa, T. Colhurst, et al., Recent experiments in large vocabulary conversational speech recognition, in:...
H. Bourlard et al.
New approaches towards robust and adaptive speech recognition
H. Bourlard et al.
Multi-stream Speech Recognition
(1996)
D.P. Ellis, Improved recognition by combining different features and different systems, in: Proceedings of AVIOS-2000,...
D.P. Ellis, Using mutual information to design feature combinations, in: Proceedings of the ICSLP-2000, Beijing,...
D.P. Ellis, R. Singh, et al., Tandem acoustic modelling in large-vocabulary recognition, in: Proceedings of the IEEE...

S. Furui, Speaker independent isolated word recognition based on emphasized spectral dynamics, in: Proceedings of the...

S. Furui

Speaker independent isolated word recognition using dynamic features of speech recognition

IEEE Trans. ASSP

(1986)

G. Gravier et al.

Markov random field modelling for speech recognition

Aust. J. Intell. Inform. Process. Systems

(1998)

V.N. Gupta, M. Lenning, et al., Integration of acoustic information in a large vocabulary word recognizer, in:...

B.A. Hanson et al.

Spectral dynamics for speech recognition under adverse conditions

Cited by (29)

Web-Spam Features Selection Using CFS-PSO
2018, Procedia Computer Science
This paper proposes Swarm based hybrid technique CFS-PSO, which combines the characteristics of Correlation Based Feature Selection Technique (CFS) and Particle Swarm Optimization (PSO) strategy. PSO is an optimization approach motivated by swarm conduct which uses the real-number randomness & the global communication among the swarm particles. Feature selection (pre-processing technique) is very crucial part of Data Mining & Machine Learning.The aims of feature selection includes building of simpler & more logical models and improving the performance in terms of reducing the time to build the learning model and increasing the accuracy. We assess the performance of CFS-PSO on WEBSPAM-UK2006 with five classifiers. Experimental results show reduction in original features & increasing the F-measure upto 88% & 45.83% respectively.
Self-representation based dual-graph regularized feature selection clustering
2016, Neurocomputing
Citation Excerpt :
Therefore, it is necessary and indispensable to use feature selection algorithms [1] to select an optional feature subset while retaining the salient characteristics of the original dataset as far as possible for compact data representation [2–4]. Feature selection algorithm has wide application, such as speech recognition [5], gene expression analysis [6], and disease diagnosis [7]. According to the way of utilizing label information [8], feature selection algorithms can be categorized as supervised algorithms [9], semi-supervised algorithms [10] and unsupervised algorithms [11].
Feature selection algorithms eliminate irrelevant and redundant features, even the noise, while preserving the most representative features. They can reduce the dimension of the dataset, extract essential features in high dimensional data and improve learning quality. Existing feature selection algorithms are all carried out in data space. However, the information of feature space cannot be fully exploited. To compensate for this drawback, this paper proposes a novel feature selection algorithm for clustering, named self-representation based dual-graph regularized feature selection clustering (DFSC). It adopts the self-representation property that data can be represented by itself. Meanwhile, the local geometrical information of both data space and feature space are preserved simultaneously. By imposing the l_2,1-norm constraint on the self-representation coefficients matrix in data space, DFSC can effectively select the most representative features for clustering. We give the objective function, develop iterative updating rules and provide the convergence proof. Two kinds of extensive experiments on some datasets demonstrate the effectiveness of DFSC. Extensive comparisons over several state-of-the-art feature selection algorithms illustrate that additionally considering the information of feature space based on self-representation property improves clustering quality. Meanwhile, because the additional feature selection process can select the most important features to preserve the intrinsic structure of dataset, the proposed algorithm achieves better clustering results compared with some co-clustering algorithms.
Multi-appliance recognition system with hybrid SVM/GMM classifier in ubiquitous smart home
2013, Information Sciences
Ubiquitous computing provides convenient and fast information distribution service by using sensor nodes and wireless network, and a good household appliance recognition system will allow users to effectively understand the household appliance usage and develop habits of power preservation. At present, smart meters convert the information of traditional electric meters to easily accessible digital information, based on which, the household appliance recognition service can be carried out. However, it is different from video or audio recognition service, when a variety of electrical appliances run, they will all have individual impact on power consumption, thereby resulting in the difficulties in recognition. Presently, the complex current information arising from many household appliances also increases the difficulty in extracting power features. For addressing the challenge, this study proposes a set of multi-appliance recognition system, which designs a single smart meter using a current sensor and a voltage sensor in combination with a microprocessor to meter multi-appliances. After fuzzy processing of the power information are read through the smart meter and extraction of the power features, electric appliances are classified using the hybrid Support Vector Machine/Gaussian Mixture Model (SVM/GMM) classification model. GMM is mainly used describe the wave distribution situation according to the current information, so as to find the power similarity; while SVM is used to classify the power features of different electric appliances, so as to summarize the classification properties of different electric appliances and establish a classification model. Finally, the household appliances that are in use can be recognized with the household power supply terminal, and their information can be reported to users through wired or wireless network to achieve ubiquitous recognition service. This study has developed and implemented this system prototype, and is used to prove its design theory.
Feature selection using fuzzy entropy measures with similarity classifier
2011, Expert Systems with Applications
Citation Excerpt :
On the other hand, a high false alarm rate causes unwanted worries and increases the load on medical resources. In quest for higher classification accuracies, feature subset selection has been used for data reduction in areas characterized by high dimensionality due to the large numbers of available features, e.g. in seismic data processing (Hoffman, Hoogenboezem, Van der Merwe, & Tollig, 1998), remote sensing (Yu, De Backer, & Scheunders, 2000), drug design (Ozdemir et al., 2001), speech recognition (Abdulla & Kasabov, 2003) and image segmentation (Matsui & Kosugi, 1999). Feature selection is expected to improve classification performance, particularly in situations characterized by the high data dimensionality problem caused by relatively few training examples compared to a large number of measured features.
Feature selection plays an important role in classification for several reasons. First it can simplify the model and this way computational cost can be reduced and also when the model is taken for practical use fewer inputs are needed which means in practice that fewer measurements from new samples are needed. Second by removing insignificant features from the data set one can also make the model more transparent and more comprehensible, providing better explanation of suggested diagnosis, which is an important requirement in medical applications. Feature selection process can also reduce noise and this way enhance the classification accuracy. In this article, feature selection method based on fuzzy entropy measures is introduced and it is tested together with similarity classifier. Model was tested with four medical data sets which were, dermatology, Pima-Indian diabetes, breast cancer and Parkinsons data set. With all the four data sets, we managed to get quite good results by using fewer features that in the original data sets. Also with Parkinsons and dermatology data sets, classification accuracy was managed to enhance significantly this way. Mean classification accuracy with Parkinsons data set being 85.03% with only two features from original 22. With dermatology data set, mean accuracy of 98.28% was achieved using 29 features instead of 34 original features. Results can be considered quite good.
Constructing optimal educational tests using GMDH-based item ranking and selection
2009, Neurocomputing
Item ranking and selection plays a key role in constructing concise and informative educational tests. Traditional techniques based on the item response theory (IRT) have been used to automate this task, but they require model parameters to be determined a priori for each item and their application becomes more tedious with larger item banks. Machine-learning techniques can be used to build data-based models that relate the test result as output to the examinees’ responses to various test items as inputs. With this approach, test item selection can benefit from the vast amount of literature on feature selection in many areas of machine learning and artificial intelligence that are characterized by high data dimensionality. This paper describes a novel technique for item ranking and selection using abductive network pass/fail classifiers based on the group method of data handling (GMDH). Experiments were carried out on a dataset consisting of the response of 2000 examinees to 45 test items together with the examinee's true ability level. The approach utilizes the ability of GMDH-based learning algorithms to automatically select optimum input features from a set of available inputs. Rankings obtained by iteratively applying this procedure are similar to those based on the average item information function (IIF) at the pass-fail ability threshold, IIF (θ=0), and the average information gain (IG). An optimum item subset derived from the GMDH-based ranking contains only one third of the test items and performs pass/fail classification with 91.2% accuracy on a 500-case evaluation subset, compared to 86.8% for a randomly selected item subset of the same size and 92% for a subset of the 15 items having the largest values for IIF (θ=0). Item rankings obtained with the proposed approach compare favorably with those obtained using neural network modeling and popular filter type feature selection methods, and the proposed approach is much faster than wrapper methods employing genetic search.
In search of an optimization technique for Artificial Neural Network to classify abnormal heart sounds
2009, Applied Soft Computing Journal
Artificial Neural Network (ANN) finds use in classification of heart sounds for its discriminative training ability and easy implementation. The selection of number of nodes for an ANN remains an important issue as an overparameterized ANN gets trained along with the redundant information that results in poor validation. Also a larger network means more computational expense, resulting more hardware and time related cost. Therefore, a compact and optimum design of neural network is needed towards real-time detection of pathological patterns, if any from heart sound signals. This work attempts to (i) design a compact form of output layer with less number of nodes than output classes, (ii) select a set of input features that are effective for identification of heart sound signals using Singular Value Decomposition (SVD), QR factorization with column pivoting (QRcp) and Fisher's F-ratio, (iii) make certain optimum selection of nodes in the hidden layer for a more effective ANN structure using SVD and (iv) select and prune weights based on the concept of local relative sensitivity index (LRSI) for empirically chosen overparameterized ANN structure for phonocardiogram (PCG) classification. It is observed that the proposed techniques perform better in terms of reduction of model residues and time complexity for classifying 12 different pathological cases and normal heart sound.

View all citing articles on Scopus

View full text