Elsevier

Speech Communication

Volume 48, Issue 6, June 2006, Pages 746-758
Speech Communication

Scalable distributed speech recognition using Gaussian mixture model-based block quantisation

https://doi.org/10.1016/j.specom.2005.10.002Get rights and content

Abstract

In this paper, we investigate the use of block quantisers based on Gaussian mixture models (GMMs) for the coding of Mel frequency-warped cepstral coefficient (MFCC) features in distributed speech recognition (DSR) applications. Specifically, we consider the multi-frame scheme, where temporal correlation across MFCC frames is exploited by the Karhunen–Loève transform of the block quantiser. Compared with vector quantisers, the GMM-based block quantiser has relatively low computational and memory requirements which are independent of bitrate. More importantly, it is bitrate scalable, which means that the bitrate can be adjusted without the need for re-training. Static parameters such as the GMM and transform matrices are stored at the encoder and decoder and bit allocations are calculated ‘on-the-fly’ without intensive processing. We have evaluated the quantisation scheme on the Aurora-2 database in a DSR framework. We show that jointly quantising more frames and using more mixture components in the GMM leads to higher recognition performance. The multi-frame GMM-based block quantiser achieves a word error rate (WER) of 2.5% at 800 bps, which is less than 1% degradation from the baseline (unquantised) word recognition accuracy, and graceful degradation down to a WER of 7% at 300 bps.

Introduction

With the increase in popularity of remote and wireless devices such as personal digital assistants (PDAs) and cellular phones, there has been a growing interest in applying automatic speech recognition (ASR) technology in the context of mobile communication systems. Speech recognition can facilitate consumers in performing common tasks, which have traditionally been accomplished via buttons or pointing devices, such as making a call through voice dialing or entering data into their PDAs via spoken commands and sentences. Some of the issues that arise when implementing ASR on mobile devices include: computational and memory constraints of the mobile device; network bandwidth utilisation; and robustness to noisy operating conditions.

Mobile devices generally have limited storage and processing ability which makes implementing a full on-board ASR system impractical. The solution to this problem is to perform the complex speech recognition task on a remote server that is accessible via the network. Various modes of this client–server approach have been proposed and reported in the literature. In the network speech recognition (NSR) mode (Kiss, 2000), the user’s speech is compressed using conventional speech coders (such as the GSM speech coder) and transmitted to the server which performs the recognition task. In speech-based NSR (Fig. 1(a)), the server calculates ASR features from the decoded speech to perform the recognition. In bitstream-based NSR (Fig. 1(b)), the server uses ASR features that are derived from linear predictive coding (LPC) parameters taken directly from the bitstream. Numerous studies have been reported in the literature evaluating and comparing the performance of these two forms of NSR (Hirsch, 1998, Huerta and Stern, 1998, Kim and Cox, 2001, Lilly and Paliwal, 1996, Raj et al., 2001, Turunen and Vlaj, 2001, Gallardo-Antolin et al., 1998).

In distributed speech recognition (DSR), shown in Fig. 1(c), the ASR system is distributed between the client and server. Here, the feature extraction of speech is performed at the client. These ASR features are compressed and transmitted to the server via a dedicated channel, where they are decoded and input into the ASR backend. Studies have shown that DSR generally performs better than NSR (Kiss, 2000) because, in the latter model, speech is processed for optimal perceptual quality and this does not necessarily result in optimal recognition performance (Srinivasamurthy et al., 2003).

Various schemes for compressing the ASR features have been proposed in the literature. Digalakis et al. (1999) evaluated the use of uniform and non-uniform scalar quantisers as well as product code vector quantisers for compressing Mel frequency-warped cepstral coefficients (MFCCs) between 1.2 and 10.4 kbps. They concluded that split vector quantisers achieved word error rates (WER) similar to that of scalar quantisers while requiring less bits. Also, scalar quantisers with non-uniform bit allocation performed better than those with uniform bit allocation. Ramaswamy and Gopalakrishnan (1998) investigated the application of tree-searched multistage vector quantisers with one-step linear prediction operating at 4 kbps. Transform coding, based on the discrete cosine transform (DCT), was investigated in Kiss and Kapanen (1999) at 4.2 kbps and in Zhu and Alwan (2001) which used a two-dimensional DCT. The ETSI DSR standard (STQ, 2000) uses split vector quantisers to compress the MFCC vectors at 4.4 kbps. Srinivasamurthy et al. (2003) exploited correlation across consecutive MFCC features by using a DPCM scheme followed by entropy coding.

Even though vector quantisers generally give better recognition performance using less bits, they are not scalable in bitrate when compared with scalar quantiser-based schemes, such as DPCM and transform coders. In other words, the vector quantiser is designed to operate at a specific bitrate only and will need to be re-trained for other bitrates. Bitrate scalability is a desirable feature in DSR applications, since one may need to adjust the bitrate adaptively, depending on the network conditions. For instance, if the communications network is heavily congested, then it may be more acceptable to sacrifice some recognition performance by operating at a lower bitrate in order to offset long response times. In addition to this, the computational complexity of vector quantisers can be quite high, when compared with scalar quantiser-based schemes.

Block quantisation or transform coding,1 has been used as a less complex alternative to fullsearch vector quantisation in the coding literature. Proposed by Kramer and Mathews (1971) and analysed by Huang et al. (1963), it involves decorrelating the components within a block or vector of samples using a linear transformation before scalar quantising each component independently. When quantising for minimum mean-squared-error (MSE), the Karhunen–Loève transform (KLT)2 is the best transform for correlated Gaussian sources (Goyal, 2001). However, the probability density functions (PDF) of real life sources are rarely Gaussian and any PDF mismatch with the quantiser will invariably cause a degradation in performance. Rather than assuming the source PDF to be a standard function such as Gaussian, Laplacian, etc., one can design a quantiser that matches the source PDF as close as possible.

There have been numerous studies in the coding literature on source PDF modelling for quantiser design. These can be broadly classified as either non-parametric or parametric modelling. Ortega and Vetterli (1996) estimated the source model in a non-parametric fashion using piecewise linear approximation. Similarly, multidimensional histograms were used by Gardner and Rao (1995) to model the PDFs of line spectral frequencies (LSF) in order to evaluate the theoretical bounds of split vector quantisers. In relation to parametric modelling, Su and Mersereau (1996) applied Gaussian mixture models (GMM) in the estimation of the PDF of DC transform coefficients while Archer and Leen (2004) used GMMs to form a probabilistic latent variable model from which transform coding design algorithms were derived. On the speech side, Hedelin and Skoglund (2000) used GMMs with bounded support for designing and evaluating high-rate vector quantisers for LSF coding while Samuelsson and Hedelin (2001) extended this work to recursive spectral coding.

Subramaniam and Rao (2003) incorporated the PDF model into the block quantisation of LSFs via GMMs and in our previous work (Paliwal and So, 2004a), we have extended this scheme to exploit memory across successive frames. Even though this quantisation scheme is not as optimal as vector quantisation, it nevertheless possesses the following advantages (Subramaniam and Rao, 2003):

  • Compact representation of the source PDF which is independent of bitrate;

  • bitrate scalability with ‘on-the-fly’ bit allocation; and

  • low search complexity and memory requirements which are independent of the rate of the system.

In this paper,3 we investigate the use of the fixed-rate, multi-frame GMM-based block quantisation scheme of Paliwal and So (2004a) for DSR applications4 that is computationally simpler than vector quantisers, is scalable in bitrate, and leads to a more graceful degradation in recognition performance when compared with other scalar quantiser-based schemes.

The organisation of this paper is as follows. We give a brief description of the multi-frame GMM-based block quantiser, bit allocation, as well as its computational and memory requirements in Section 2. In Section 3, we describe the setup of our recognition experiments on the Aurora-2 database. Following this, in Section 4, we present and discuss the recognition results for the single frame and multi-frame GMM-based block quantiser and compare them with results from the non-uniform scalar quantiser and vector quantiser. Finally, we offer our conclusions in Section 5 as well as further work that needs to be done.

Section snippets

Multi-frame GMM-based block quantization

This quantisation scheme is based on the one proposed by Subramaniam and Rao (2003) for the quantising of speech line spectral frequencies (LSF), where a Gaussian mixture model (GMM) is used to parametrically model the probability density function (PDF) of the source and block quantisers are then designed for each Gaussian mixture component. In Paliwal and So (2004a), we proposed a modified scheme which used vectors formed by concatenating p successive frames, in order to exploit interframe

Experimental setup

We have evaluated the recognition performance of various quantisation schemes using the publicly available HTK 3.2 software on the ETSI Aurora-2 database (Hirsch et al., 2000). The purpose of the Aurora-2 database is to provide a common framework for evaluating the DSR-related issues in a connecteddigit recognition task. It consists of a clean (or, noise-free) speech database,

Recognition performance of the single frame GMM-based block quantiser

Table 2 shows the recognition accuracy for the single frame GMM-based block quantiser at various bitrates and number of mixture components. At 2 kbps, the recognition accuracy is roughly the same as the unquantised scheme. Between 2 kbps and 800 bps, the recognition performance gradually decreases where it can be seen that the schemes which uses a larger number of mixture components maintain a higher accuracy. This may be attributed to the more accurate modelling of the source PDF by using more

Conclusion and further work

In this paper, we have investigated the use of the multi-frame GMM-based block quantiser for the quantising of MFCC features in DSR applications. The strengths of this quantisation scheme are its computational simplicity when compared with vector quantisers, bitrate scalability, and graceful degradation of recognition performance at very low bitrates via effective exploitation of intraframe and interframe correlation of MFCC frames. Because the PDF model and transformation matrices are

Acknowledgements

We would like to thank the anonymous reviewers for their detailed comments, which have been extremely helpful in improving the clarity and quality of this paper.

References (33)

  • Adami, A., Burget, L., Dupont, S., Garudadri, H., Grezl, F., Hermansky, H., Jain, P., Kajarekar, S., Morgan, N.,...
  • C. Archer et al.

    A generalized Lloyd-type algorithm for adaptive transform coder design

    IEEE Trans. Signal Process.

    (2004)
  • A.P. Dempster et al.

    Maximum likelihood from incomplete data via the EM algorithm

    J. Roy. Stat. Soc.

    (1977)
  • V.V. Digalakis et al.

    Quantization of cepstral parameters for speech recognition over the world wide web

    IEEE J. Select. Areas Commun.

    (1999)
  • Gallardo-Antolin, A., Diaz-de-Maria, F., Valverde-Albacete, F., 1998. Recognition from GSM digital speech. In: Proc....
  • W.R. Gardner et al.

    Theoretical analysis of the high-rate vector quantization of LPC parameters

    IEEE Trans. Speech Audio Process.

    (1995)
  • A. Gersho et al.

    Vector Quantization and Signal Compression

    (1992)
  • V.K. Goyal

    Therotical foundations of transform coding

    IEEE Signal Process. Mag.

    (2001)
  • P. Hedelin et al.

    Vector quantization based on Gaussian mixture models

    IEEE Trans. Speech Audio Process.

    (2000)
  • Hirsch, H.G., 1998. The influence of speech coding on recognition performance in telecommunication networks. In: Proc....
  • Hirsch, H.G., Pearce, D., 2000. The Aurora experimental framework for the performance evaluation of speech recognition...
  • J.J.Y. Huang et al.

    Block quantization of correlated Gaussian random variables

    IEEE Trans. Commun. Syst.

    (1963)
  • Huerta, J.M., Stern, R.M., 1998. Speech recognition from GSM codec parameters. In: Proc. ICSLP 4, pp....
  • Jarvinen, K., Vainio, J., Kapanen, P., Honkanen, T., Haavisto, P., 1997. GSM enhanced full rate speech codec. In: Proc....
  • B.H. Juang et al.

    On the use of bandpass liftering in speech recognition

    IEEE Trans. Acoust., Speech, Signal Process.

    (1987)
  • H.K. Kim et al.

    A bitstream-based front-end for wireless speech recognition on IS-136 communications system

    IEEE Trans. Speech Audio Process.

    (2001)
  • Cited by (0)

    View full text