Elsevier

Information Sciences

Volume 156, Issues 1–2, 1 November 2003, Pages 21-38
Information Sciences

Reduced feature-set based parallel CHMM speech recognition systems

https://doi.org/10.1016/S0020-0255(03)00162-2Get rights and content

Abstract

This paper presents the multi-streams paradigm as a technique for improving speech signal feature set design and as a performance booster for speech recognition systems, based on the continuous-density hidden Markov model (CHMM) framework. In the multi-streams paradigm we are dealing with different feature sets independently to estimate the same task, and then combining their results at a suitable stage. This paradigm combines the strengths of many varied feature vectors to attain better statistical estimation. Under the proposed paradigm the feature vectors are split into three independent streams, and each stream is used to model an independent CHMM. Then the outcomes of these models, when subjected to any speech input, are merged under a certain strategy. This technique alleviates the dominance effect of the features, and reduces the dimensionality of the feature vectors used in each model. The F-ratio technique is used to further reduce the dimensionality of each stream. Experimental results on different datasets show superiority of the developed paradigm over the corresponding single-stream baseline.

Introduction

Speaker-independent speech recognition systems have many parameters to optimise during the implementation course. There are vast uncertainties to deal with, coming from varied production behaviour of different speakers. Statistical approaches using HMM show superiority over the other techniques to capture and model the features that are carrying the spoken information. The HMM framework interprets the speech signals to changeable-duration sequence of events called states [25]. The performance of the HMM model in discriminating the acoustic classes is highly affected by the observation feature vectors. They are considered as abstract mappings of the highly redundant speech samples. The feature vectors needed to be as short as possible in terms of dimensions which imply redundancy removal, and contain as much as possible of linguistic information. The selected features must assure fast training and recognition procedures, as well as superior acoustic class discrimination. The feature vectors have been widely investigated, and many suggestions have been proposed to reach the ultimate optimality goal of good abstraction and representation. The current approaches rely mainly on the successful Mel frequency cepstral coefficients (MFCCs) vectors to represent the speech samples. Other types of features, different from the MFCCs, have also been introduced and have some strength in certain applications. No feature set can be decided as the best absolute performer under all environmental conditions, in the automatic speech recognition (ASR) systems. One solution to exploit the strengths of the different feature sets is to combine them deliberately under a suitable paradigm. The combination of the features can be done at several points within the ASR structure. The features can be concatenated at the very beginning stage in the feature streams domain and presented to the general classifier, or left as they are in different independent streams and presented to separate classifiers. Also, the outcomes of the classifiers can be merged then presented to the general HMM decoder, or left as they are and presented to separate HMM decoders. The two main questions that needed to be answered in any multi-stream based ASR system design are: What feature set to stream? And, where to merge? There is no agreed analytical procedure to answer these queries, and they have mainly heuristic oriented solutions. However, there are some trends in using statistical notions to help in some decisions. The conditional mutual information (CMI) is used to predict which feature streams will merge most advantageously, and which of the many possible-merging strategies will be most successful, it answers the first question. The CMI of the raw feature streams is supposed to help in deciding whether to merge them together as one large stream, or to feed them separately into independent classifiers for later merging [7], [8]. The results of the CMI technique are not very encouraging as reported.

The important property regarding the feature streams nature is that combining a number of diverse feature streams often improves the recognition performance, and the greatest benefits come from combinations between the most diverse features [30]. Different front-end structures can be used to maximise the feature stream diversity. Combining perceptual linear predictive (PLP) features with the modulation-filtered spectrogram (MSG) features improves the recognition rate significantly [9]. In fact, any change in the feature vectors preparation procedure will lead to an improvement in the recognition rate. Bella et al. [4] found that the combination of nearly identical sets of features with only difference in frame rate, which was set between 80 and 125 frames per second, was enough to introduce some decorrelation between the errors in the streams. The merged system performed significantly better than any one of the component streams. Variable frame rate is also useful in single stream ASR system. In this case, the frame rate is increased for rapidly changing segments with relatively high energy, and reduced for steady-state segments [31].

The classifiers are either Gaussian mixture models (GMM) or neural networks (NNs). Hybridising HMM with NN is widely used in single-stream structure for continuous speech recognition systems [22]. HMM speech recognition systems typically use GMM. The NNs are becoming popular in the multi-stream paradigm, due to their potential in estimating the probability functions and classifications. Merging the streams after the classification stage (posterior merging), rather than feature concatenation, ameliorates the recognition rate one step further [7], [8], [26]. The classifiers outcomes might be merged and decorrelated first, then presented to a GMM of a classical HMM decoder for better recognition performance [9]. The multi-estimation notion is also applicable to the NN based systems. It has been shown that recognition performance can be improved by using the same feature sets to train two NNs, with different initialisation points [19].

The other interesting approach in multi-streams research trends comes from sub-band notion. Rather than deriving the probability streams from completely different acoustic representations, it is also possible to divide a single representation into disjoint regions across the spectrum. Each of the sub-bands can then be used as the basis for separate probability estimators. The output of these estimators can be combined, either by averaging the log posterior probabilities for each class, or by using more complex methods including multi-layer perceptrons or weighted combinations [5], [6]. More specifically, in the sub-band technique, the whole frequency band of the speech signal is split into several sub-bands. Then, each of these sub-bands is processed independently, mostly by a hybrid HMM neural network model. This technique is based on the assumption of sub-band independence, which is not very true, as in reality there is dependency between them. Next, the sub-band outcomes are recombined at several stages during the utterance period according to certain criteria. The main advantage of this approach is the robustness of the recogniser to selective narrow-band noise [21]. This technique is also adopted in random field modelling to model the hidden states of HMM [12].

The multi-streaming is also viewed from another perspective by splitting the feature vectors into a specified number of sub-vectors, which are then processed by different quantizers, and a vector of discrete values with the same length as the number of sub-vectors is the input to the discrete recogniser [28]. This system was experimented with 9, 15, 24, and 39 sub-vectors, and it showed improvement in recognition rate as compared with the conventional CHMM.

The multi-stream approach was also investigated from the recognition rate in noisy environment perspective, and it showed substantial improvement in recognition performance under different noise sources [27].

In this paper, we will deal with the Mels coefficients and their first and second derivatives, as three independent streams. These streams have some sort of dependency, as it is obvious from the way of producing them. However, they showed enhanced effectiveness on the recognition rate when they were dealt with as independent. Thus, the feature vectors adopted comprise 39 coefficients (12 Mels and one power coefficient with their first and second derivatives) per observation; equally divided between three streams. Then we will reduce the dimensionality of each stream, using the F-ratio technique as a figure of merit. This reduction leaves only 28 MFCCs per observation vector to be used in our ASR system, instead of the original 39 coefficients.

This paper is organised as follows. Section 2 briefly demonstrates some related feature vectors designs. Section 3 describes the F-ratio as a figure of merit to assess the importance of the features, and how it can be directly applied on the HMM parameters. Section 4 explains the parallel HMM notion and the dimensionality-reduction application. Section 5 evaluates the performance of ASR systems based on different paradigms. The conclusions will be summarised in Section 6.

Section snippets

Feature vector design based on static and dynamic coefficients

The current approaches rely mainly on the successful Mel frequency cepstral coefficients (MFCCs) vectors to represent each 10–50 ms window of speech samples, taken each 5–25 ms, by a single vector of certain dimension. The window length and rate as well as the feature vectors dimension are decided according to the application task. For many applications the most effective components of the Mel scale features are the first 12 coefficients (excluding the zero coefficient), which are also called

Dimensionality reduction based on the F-ratio figures

The F-ratio is a measure that can be used to evaluate the effectiveness of a particular feature. It has been widely used as a figure of merit for feature selection in speaker recognition applications [24], [29]. It is defined as the ratio of the between-class variance (B) and the within-class variance (W). In the contest of feature selection for pattern classification, the F-ratio can be considered as a strong catalyst to select the features that maximise the separation between different

Parallel HMM multi-streams-based system

We developed this system based on the multi-stream notion, targeting the advantages of alleviating the dominance problem, the dimensionality reduction and flexibility of the design [3]. The speech signal feature vectors selected are the power and Mel-scale coefficients with their first and second derivatives, deltas. This selection is due to the high potential of these coefficients in carrying the static and temporal information of the spoken signals. The first derivative can be approximated by

Comparative studies of different ASR system paradigms

In this section, we depict the recognition rate of some successful ASR systems. The feature vectors used in all the systems are of 28 MFCCs and in the proportion specified in the previous section. The paradigms will be denoted by the following names for simplicity:

  • ASR-1:

    is a single stream multi-mixture model of type CHMM with nine states and five mixtures.

  • ASR-2:

    is a multi-stream ASR with equal stream merging weights. Three streams are used; each of them is modelled by a five mixtures CHMM with nine states.

Conclusions

In this paper, we have investigated the problem of improving the speech recognition performance by restructuring the method of using the feature vectors. Instead of dealing with the composite static and dynamic speech signal features based on the MFCCs as a single stream, we proposed splitting them into three independent streams. Despite the fact that the streams’ independence assumption is not very precise, from the method of deriving their vectors, the multi-streams paradigm has outperformed

Acknowledgements

The authors would like to thank Professor Garry Tee for his constructive comments on this paper. This research is funded by The University of Auckland Research Fund under project number UARF 3602239/9273 and FRST of New Zealand, grant NERF AUT02/001.

References (31)

  • K.K. Paliwal

    Dimensionality reduction of the enhanced feature set for the HMM-based speech recognizer

    Digital Signal Processing

    (1992)
  • W.H. Abdulla, N.K. Kasabov, Speech recognition enhancement via robust CHMM speech background discrimination, in: Proc....
  • W.H. Abdulla, N.K. Kasabov, Two pass hidden Markov model for speech recognition systems, in: Proceedings of ICICS’99,...
  • W.H. Abdulla, N.K. Kasabov, Feature selection for parallel CHMM speech recognition systems, in: Proceedings of the...
  • J. Billa, T. Colhurst, et al., Recent experiments in large vocabulary conversational speech recognition, in:...
  • H. Bourlard et al.

    New approaches towards robust and adaptive speech recognition

  • H. Bourlard et al.

    Multi-stream Speech Recognition

    (1996)
  • D.P. Ellis, Improved recognition by combining different features and different systems, in: Proceedings of AVIOS-2000,...
  • D.P. Ellis, Using mutual information to design feature combinations, in: Proceedings of the ICSLP-2000, Beijing,...
  • D.P. Ellis, R. Singh, et al., Tandem acoustic modelling in large-vocabulary recognition, in: Proceedings of the IEEE...
  • S. Furui, Speaker independent isolated word recognition based on emphasized spectral dynamics, in: Proceedings of the...
  • S. Furui

    Speaker independent isolated word recognition using dynamic features of speech recognition

    IEEE Trans. ASSP

    (1986)
  • G. Gravier et al.

    Markov random field modelling for speech recognition

    Aust. J. Intell. Inform. Process. Systems

    (1998)
  • V.N. Gupta, M. Lenning, et al., Integration of acoustic information in a large vocabulary word recognizer, in:...
  • B.A. Hanson et al.

    Spectral dynamics for speech recognition under adverse conditions

  • Cited by (29)

    • Web-Spam Features Selection Using CFS-PSO

      2018, Procedia Computer Science
    • Self-representation based dual-graph regularized feature selection clustering

      2016, Neurocomputing
      Citation Excerpt :

      Therefore, it is necessary and indispensable to use feature selection algorithms [1] to select an optional feature subset while retaining the salient characteristics of the original dataset as far as possible for compact data representation [2–4]. Feature selection algorithm has wide application, such as speech recognition [5], gene expression analysis [6], and disease diagnosis [7]. According to the way of utilizing label information [8], feature selection algorithms can be categorized as supervised algorithms [9], semi-supervised algorithms [10] and unsupervised algorithms [11].

    • Feature selection using fuzzy entropy measures with similarity classifier

      2011, Expert Systems with Applications
      Citation Excerpt :

      On the other hand, a high false alarm rate causes unwanted worries and increases the load on medical resources. In quest for higher classification accuracies, feature subset selection has been used for data reduction in areas characterized by high dimensionality due to the large numbers of available features, e.g. in seismic data processing (Hoffman, Hoogenboezem, Van der Merwe, & Tollig, 1998), remote sensing (Yu, De Backer, & Scheunders, 2000), drug design (Ozdemir et al., 2001), speech recognition (Abdulla & Kasabov, 2003) and image segmentation (Matsui & Kosugi, 1999). Feature selection is expected to improve classification performance, particularly in situations characterized by the high data dimensionality problem caused by relatively few training examples compared to a large number of measured features.

    View all citing articles on Scopus
    View full text