Elsevier

Speech Communication

Volume 47, Issue 3, November 2005, Pages 265-276
Speech Communication

Multi-frame GMM-based block quantisation of line spectral frequencies

https://doi.org/10.1016/j.specom.2005.02.007Get rights and content

Abstract

In this paper, we investigate the use of the Gaussian mixture model-based block quantiser for coding line spectral frequencies that uses multiple frames and mean squared error as the quantiser selection criterion. As a viable alternative to vector quantisers, the GMM-based block quantiser encompasses both low computational and memory requirements as well as bitrate scalability. Jointly quantising multiple frames allows the exploitation of correlation across successive frames which leads to more efficient block quantisation. The efficiency gained from joint quantisation permits the use of the mean squared error distortion criterion for cluster quantiser selection, rather than the computationally expensive spectral distortion. The distortion performance gains come at the cost of an increase in computational complexity and memory. Experiments on narrowband speech from the TIMIT database demonstrate that the multi-frame GMM-based block quantiser can achieve a spectral distortion of 1 dB at 22 bits/frame, or 21 bits/frame with some added complexity.

Introduction

Linear predictive coding (LPC) of speech requires the accurate quantisation of parameters representing the spectral envelope. Speech is windowed into frames and the spectral envelope is parametrically modelled as an all-pole filter, whose coefficients are called linear predictive coding (LPC) parameters. These LPC parameters are generally quantised in terms of line spectral frequencies (LSFs) using a vector quantiser (VQ). Extrapolating from the operating curve of full search VQ suggests that we need about 19 bits/frame to achieve transparent coding of these parameters (Paliwal and Kleijn, 1995), while high rate analysis predicts a lower bound of 23 bits/frame1 (Hedelin and Skoglund, 2000). It is not possible to design codebooks at these rates and in addition, the computational cost of the resulting full search vector quantiser is very high.

Less complex but suboptimal vector quantisers such as multistage and split VQ have been investigated in the speech coding literature (LeBlanc et al., 1993, Paliwal and Atal, 1993), where it was generally observed that 22–24 bits/frame were required to achieve transparent coding2 in speech, with varying degrees of complexity. Further gains in performance can be achieved by exploiting temporal correlation between successive frames. Matrix quantisation (Tsao and Gray, 1985) and its derivatives such as split matrix quantisation (Xydeas and Papanastasiou, 1999) and multi-mode matrix quantisation (Nurminen et al., 2003, Sinervo et al., 2003) perform better by jointly quantising LSF frames.

The use of Gaussian mixture models (GMM) for the coding of LSFs has been investigated in (Hedelin and Skoglund, 2000, Shabestary and Hedelin, 2002, Subramaniam and Rao, 2000, Subramanian and Rao, 2001, Subramaniam and Rao, 2003). In (Subramaniam and Rao, 2003), a Gaussian mixture model (GMM) is used to parameterise the probability density function (PDF) of the source and optimised Gaussian block quantisers are designed for each cluster (or, mixture component).3 Using this quantiser in its fixed rate mode, a spectral distortion of approximately 1 dB was achieved at 24 bits/frame. The main advantages of this scheme over vector quantisers include (Subramaniam and Rao, 2003):

  • 1.

    lower complexity through the use of block quantisers;

  • 2.

    bitrate scalability; and

  • 3.

    search complexity and memory requirements being independent of the rate of the system.

A modified quantiser with memory was also described in (Subramaniam and Rao, 2003) that coded the difference between successive frames, similar to differential pulse code modulation (DPCM) with a one-tap predictor. A spectral distortion of 1 dB was achieved at 22 bits/frame (Subramaniam and Rao, 2003). During the coding process, there is frequent use of the spectral distortion (SD) calculation for cluster quantiser selection. While there are approximate high-rate expressions for the spectral distortion calculation (Gardner and Rao, 1995), the number of computations is still comparatively higher than mean squared error (MSE).

In this paper, we investigate a modified version of the fixed-rate GMM-based block quantiser that operates on multiple frames and uses the mean squared error (MSE) distortion criterion.4 We have found this system to perform better than the single frame as well as predictive quantiser of (Subramaniam and Rao, 2003) in terms of spectral distortion.

The organisation of this paper is as follows. Section 2 introduces some preliminaries such as the line spectral frequency representation of LPC parameters and distortion measures that are commonly used in speech coding. In Section 3, we describe the operation of the multi-frame GMM-based block quantiser as well as its computational and memory requirements. Section 4 details the LPC analysis method and speech database that we have used to evaluate the performance of the quantiser. Following this is a discussion of the performance of the multi-frame GMM-based block quantiser and how it compares with other quantisation schemes. Finally we conclude in Section 6.

Section snippets

LSF representation of LPC coefficients

In the LPC analysis of speech, a short segment of speech is assumed to be the output of an all-pole filter, H(z)=1A(z), driven by white Gaussian noise, where A(z) is the inverse filter given by (Paliwal and Atal, 1993):A(z)=1+a1z-1+a2z-2++anz-nHere, n is the order of LPC analysis and {ai}i=1n are the LPC coefficients. Because H(z) is used to reconstruct speech in linear predictive speech coders, its stability is of utmost importance and cannot be ensured when LPC coefficients are coded

Multi-frame GMM-based block quantisation

The multi-frame GMM-based block quantiser is based on the memoryless version proposed by Subramaniam and Rao (2003) for the coding of speech line spectral frequencies (LSF), where a Gaussian mixture model (GMM) is used to parametrically model the probability density function (PDF) of the source and block quantisers are then designed for each Gaussian mixture component (or, cluster). This modified scheme exploits interframe correlation by concatenating p successive frames into a larger vector.

Experimental setup

The TIMIT database was used to train and test the various quantisation schemes. It consists of speech down-sampled to 8 kHz with a 3.4 kHz anti-aliasing filter applied. A 20 ms Hamming window is used and a tenth order linear predictive analysis is performed on each frame using the autocorrelation method (Paliwal and Kleijn, 1995). There is no overlap between successive speech frames. High frequency compensation and bandwidth expansion of 15 Hz

Spectral distortion performance of the 16 cluster multi-frame GMM-based block quantiser

Table 3 shows the spectral distortion performance of the 16 cluster, multi-frame GMM-based block quantiser for varying bitrates and number of concatenated frames, p. A spectral distortion of 1 dB has been achieved at 22 bits/frame with p = 3. For any given bitrate, the spectral distortion decreases as more frames are concatenated together. This may be attributed to the decorrelation of LSFs within and across frames by the KLT. Because the dimension of the vectors is larger, the block quantiser can

Conclusion

In this paper, we have investigated the multi-frame GMM-based block quantiser for the coding of line spectral frequencies. By concatenating multiple frames together, correlation between LSFs within each frame and across successive frames can be exploited by the KLT, leading to better coding. The efficiency gained from joint quantisation permits the use of the mean squared error distortion criterion for cluster quantiser selection, rather than the computationally expensive spectral distortion,

References (28)

  • N. Sugamura et al.

    Speech analysis and synthesis methods developed at ECL in NTT–from LPC to LSP–

    Speech Commun.

    (1986)
  • B.S. Atal et al.

    Predictive coding of speech signals and subjective error criteria

    IEEE Trans. Acoust., Speech, Signal Process.

    (1979)
  • Campbell, Jr., J.P., Welch, V.C., Tremain, T.E., 1989. An expandable error-protected 4800 bps CELP Coder (U.S. Federal...
  • A.P. Dempster et al.

    Maximum likelihood from incomplete data via the EM algorithm

    J. Roy. Stat. Soc.

    (1977)
  • W.R. Gardner et al.

    Theoretical analysis of the high-rate vector quantization of LPC parameters

    IEEE Trans. Speech Audio Process.

    (1995)
  • A. Gersho et al.

    Vector Quantization and Signal Compression

    (1992)
  • A. Gray et al.

    Quantization and bit allocation in speech processing

    IEEE Trans. Acoust., Speech, Signal Process.

    (1976)
  • P. Hedelin et al.

    Vector quantization based on Gaussian mixture models

    IEEE Trans. Speech Audio Process.

    (2000)
  • J.J.Y. Huang et al.

    Block quantization of correlated Gaussian random variables

    IEEE Trans. Commun. Syst.

    (1963)
  • F. Itakura

    Line spectrum representation of linear predictive coefficients of speech signals

    J. Acoust. Soc. Am.

    (1975)
  • F. Itakura et al.

    Speech analysis-synthesis based on the partial autocorrelation coefficient

    Proc. JSA

    (1969)
  • P. Kroon et al.

    Linear-prediction based analysis-by-synthesis coding

  • W.P. LeBlanc et al.

    Efficient search and design procedures for robust multi-stage VQ of LPC parameters for 4 kb/s speech coding

    IEEE Trans. Speech Audio Process.

    (1993)
  • Y. Linde et al.

    An algorithm for vector quantizer design

    IEEE Trans. Commun.

    (1980)
  • Cited by (7)

    • Reduced complexity two stage vector quantization

      2009, Digital Signal Processing: A Review Journal
    View all citing articles on Scopus
    View full text