Reduced feature-set based parallel CHMM speech recognition systems
Introduction
Speaker-independent speech recognition systems have many parameters to optimise during the implementation course. There are vast uncertainties to deal with, coming from varied production behaviour of different speakers. Statistical approaches using HMM show superiority over the other techniques to capture and model the features that are carrying the spoken information. The HMM framework interprets the speech signals to changeable-duration sequence of events called states [25]. The performance of the HMM model in discriminating the acoustic classes is highly affected by the observation feature vectors. They are considered as abstract mappings of the highly redundant speech samples. The feature vectors needed to be as short as possible in terms of dimensions which imply redundancy removal, and contain as much as possible of linguistic information. The selected features must assure fast training and recognition procedures, as well as superior acoustic class discrimination. The feature vectors have been widely investigated, and many suggestions have been proposed to reach the ultimate optimality goal of good abstraction and representation. The current approaches rely mainly on the successful Mel frequency cepstral coefficients (MFCCs) vectors to represent the speech samples. Other types of features, different from the MFCCs, have also been introduced and have some strength in certain applications. No feature set can be decided as the best absolute performer under all environmental conditions, in the automatic speech recognition (ASR) systems. One solution to exploit the strengths of the different feature sets is to combine them deliberately under a suitable paradigm. The combination of the features can be done at several points within the ASR structure. The features can be concatenated at the very beginning stage in the feature streams domain and presented to the general classifier, or left as they are in different independent streams and presented to separate classifiers. Also, the outcomes of the classifiers can be merged then presented to the general HMM decoder, or left as they are and presented to separate HMM decoders. The two main questions that needed to be answered in any multi-stream based ASR system design are: What feature set to stream? And, where to merge? There is no agreed analytical procedure to answer these queries, and they have mainly heuristic oriented solutions. However, there are some trends in using statistical notions to help in some decisions. The conditional mutual information (CMI) is used to predict which feature streams will merge most advantageously, and which of the many possible-merging strategies will be most successful, it answers the first question. The CMI of the raw feature streams is supposed to help in deciding whether to merge them together as one large stream, or to feed them separately into independent classifiers for later merging [7], [8]. The results of the CMI technique are not very encouraging as reported.
The important property regarding the feature streams nature is that combining a number of diverse feature streams often improves the recognition performance, and the greatest benefits come from combinations between the most diverse features [30]. Different front-end structures can be used to maximise the feature stream diversity. Combining perceptual linear predictive (PLP) features with the modulation-filtered spectrogram (MSG) features improves the recognition rate significantly [9]. In fact, any change in the feature vectors preparation procedure will lead to an improvement in the recognition rate. Bella et al. [4] found that the combination of nearly identical sets of features with only difference in frame rate, which was set between 80 and 125 frames per second, was enough to introduce some decorrelation between the errors in the streams. The merged system performed significantly better than any one of the component streams. Variable frame rate is also useful in single stream ASR system. In this case, the frame rate is increased for rapidly changing segments with relatively high energy, and reduced for steady-state segments [31].
The classifiers are either Gaussian mixture models (GMM) or neural networks (NNs). Hybridising HMM with NN is widely used in single-stream structure for continuous speech recognition systems [22]. HMM speech recognition systems typically use GMM. The NNs are becoming popular in the multi-stream paradigm, due to their potential in estimating the probability functions and classifications. Merging the streams after the classification stage (posterior merging), rather than feature concatenation, ameliorates the recognition rate one step further [7], [8], [26]. The classifiers outcomes might be merged and decorrelated first, then presented to a GMM of a classical HMM decoder for better recognition performance [9]. The multi-estimation notion is also applicable to the NN based systems. It has been shown that recognition performance can be improved by using the same feature sets to train two NNs, with different initialisation points [19].
The other interesting approach in multi-streams research trends comes from sub-band notion. Rather than deriving the probability streams from completely different acoustic representations, it is also possible to divide a single representation into disjoint regions across the spectrum. Each of the sub-bands can then be used as the basis for separate probability estimators. The output of these estimators can be combined, either by averaging the log posterior probabilities for each class, or by using more complex methods including multi-layer perceptrons or weighted combinations [5], [6]. More specifically, in the sub-band technique, the whole frequency band of the speech signal is split into several sub-bands. Then, each of these sub-bands is processed independently, mostly by a hybrid HMM neural network model. This technique is based on the assumption of sub-band independence, which is not very true, as in reality there is dependency between them. Next, the sub-band outcomes are recombined at several stages during the utterance period according to certain criteria. The main advantage of this approach is the robustness of the recogniser to selective narrow-band noise [21]. This technique is also adopted in random field modelling to model the hidden states of HMM [12].
The multi-streaming is also viewed from another perspective by splitting the feature vectors into a specified number of sub-vectors, which are then processed by different quantizers, and a vector of discrete values with the same length as the number of sub-vectors is the input to the discrete recogniser [28]. This system was experimented with 9, 15, 24, and 39 sub-vectors, and it showed improvement in recognition rate as compared with the conventional CHMM.
The multi-stream approach was also investigated from the recognition rate in noisy environment perspective, and it showed substantial improvement in recognition performance under different noise sources [27].
In this paper, we will deal with the Mels coefficients and their first and second derivatives, as three independent streams. These streams have some sort of dependency, as it is obvious from the way of producing them. However, they showed enhanced effectiveness on the recognition rate when they were dealt with as independent. Thus, the feature vectors adopted comprise 39 coefficients (12 Mels and one power coefficient with their first and second derivatives) per observation; equally divided between three streams. Then we will reduce the dimensionality of each stream, using the F-ratio technique as a figure of merit. This reduction leaves only 28 MFCCs per observation vector to be used in our ASR system, instead of the original 39 coefficients.
This paper is organised as follows. Section 2 briefly demonstrates some related feature vectors designs. Section 3 describes the F-ratio as a figure of merit to assess the importance of the features, and how it can be directly applied on the HMM parameters. Section 4 explains the parallel HMM notion and the dimensionality-reduction application. Section 5 evaluates the performance of ASR systems based on different paradigms. The conclusions will be summarised in Section 6.
Section snippets
Feature vector design based on static and dynamic coefficients
The current approaches rely mainly on the successful Mel frequency cepstral coefficients (MFCCs) vectors to represent each 10–50 ms window of speech samples, taken each 5–25 ms, by a single vector of certain dimension. The window length and rate as well as the feature vectors dimension are decided according to the application task. For many applications the most effective components of the Mel scale features are the first 12 coefficients (excluding the zero coefficient), which are also called
Dimensionality reduction based on the F-ratio figures
The F-ratio is a measure that can be used to evaluate the effectiveness of a particular feature. It has been widely used as a figure of merit for feature selection in speaker recognition applications [24], [29]. It is defined as the ratio of the between-class variance (B) and the within-class variance (W). In the contest of feature selection for pattern classification, the F-ratio can be considered as a strong catalyst to select the features that maximise the separation between different
Parallel HMM multi-streams-based system
We developed this system based on the multi-stream notion, targeting the advantages of alleviating the dominance problem, the dimensionality reduction and flexibility of the design [3]. The speech signal feature vectors selected are the power and Mel-scale coefficients with their first and second derivatives, deltas. This selection is due to the high potential of these coefficients in carrying the static and temporal information of the spoken signals. The first derivative can be approximated by
Comparative studies of different ASR system paradigms
In this section, we depict the recognition rate of some successful ASR systems. The feature vectors used in all the systems are of 28 MFCCs and in the proportion specified in the previous section. The paradigms will be denoted by the following names for simplicity:
- ASR-1:
is a single stream multi-mixture model of type CHMM with nine states and five mixtures.
- ASR-2:
is a multi-stream ASR with equal stream merging weights. Three streams are used; each of them is modelled by a five mixtures CHMM with nine states.
Conclusions
In this paper, we have investigated the problem of improving the speech recognition performance by restructuring the method of using the feature vectors. Instead of dealing with the composite static and dynamic speech signal features based on the MFCCs as a single stream, we proposed splitting them into three independent streams. Despite the fact that the streams’ independence assumption is not very precise, from the method of deriving their vectors, the multi-streams paradigm has outperformed
Acknowledgements
The authors would like to thank Professor Garry Tee for his constructive comments on this paper. This research is funded by The University of Auckland Research Fund under project number UARF 3602239/9273 and FRST of New Zealand, grant NERF AUT02/001.
References (31)
Dimensionality reduction of the enhanced feature set for the HMM-based speech recognizer
Digital Signal Processing
(1992)- W.H. Abdulla, N.K. Kasabov, Speech recognition enhancement via robust CHMM speech background discrimination, in: Proc....
- W.H. Abdulla, N.K. Kasabov, Two pass hidden Markov model for speech recognition systems, in: Proceedings of ICICS’99,...
- W.H. Abdulla, N.K. Kasabov, Feature selection for parallel CHMM speech recognition systems, in: Proceedings of the...
- J. Billa, T. Colhurst, et al., Recent experiments in large vocabulary conversational speech recognition, in:...
- et al.
New approaches towards robust and adaptive speech recognition
- et al.
Multi-stream Speech Recognition
(1996) - D.P. Ellis, Improved recognition by combining different features and different systems, in: Proceedings of AVIOS-2000,...
- D.P. Ellis, Using mutual information to design feature combinations, in: Proceedings of the ICSLP-2000, Beijing,...
- D.P. Ellis, R. Singh, et al., Tandem acoustic modelling in large-vocabulary recognition, in: Proceedings of the IEEE...
Speaker independent isolated word recognition using dynamic features of speech recognition
IEEE Trans. ASSP
Markov random field modelling for speech recognition
Aust. J. Intell. Inform. Process. Systems
Spectral dynamics for speech recognition under adverse conditions
Cited by (29)
Web-Spam Features Selection Using CFS-PSO
2018, Procedia Computer ScienceSelf-representation based dual-graph regularized feature selection clustering
2016, NeurocomputingCitation Excerpt :Therefore, it is necessary and indispensable to use feature selection algorithms [1] to select an optional feature subset while retaining the salient characteristics of the original dataset as far as possible for compact data representation [2–4]. Feature selection algorithm has wide application, such as speech recognition [5], gene expression analysis [6], and disease diagnosis [7]. According to the way of utilizing label information [8], feature selection algorithms can be categorized as supervised algorithms [9], semi-supervised algorithms [10] and unsupervised algorithms [11].
Multi-appliance recognition system with hybrid SVM/GMM classifier in ubiquitous smart home
2013, Information SciencesFeature selection using fuzzy entropy measures with similarity classifier
2011, Expert Systems with ApplicationsCitation Excerpt :On the other hand, a high false alarm rate causes unwanted worries and increases the load on medical resources. In quest for higher classification accuracies, feature subset selection has been used for data reduction in areas characterized by high dimensionality due to the large numbers of available features, e.g. in seismic data processing (Hoffman, Hoogenboezem, Van der Merwe, & Tollig, 1998), remote sensing (Yu, De Backer, & Scheunders, 2000), drug design (Ozdemir et al., 2001), speech recognition (Abdulla & Kasabov, 2003) and image segmentation (Matsui & Kosugi, 1999). Feature selection is expected to improve classification performance, particularly in situations characterized by the high data dimensionality problem caused by relatively few training examples compared to a large number of measured features.
In search of an optimization technique for Artificial Neural Network to classify abnormal heart sounds
2009, Applied Soft Computing Journal