Multi-frame GMM-based block quantisation of line spectral frequencies

doi:10.1016/j.specom.2005.02.007

Speech Communication

Volume 47, Issue 3, November 2005, Pages 265-276

https://doi.org/10.1016/j.specom.2005.02.007 Get rights and content

Abstract

In this paper, we investigate the use of the Gaussian mixture model-based block quantiser for coding line spectral frequencies that uses multiple frames and mean squared error as the quantiser selection criterion. As a viable alternative to vector quantisers, the GMM-based block quantiser encompasses both low computational and memory requirements as well as bitrate scalability. Jointly quantising multiple frames allows the exploitation of correlation across successive frames which leads to more efficient block quantisation. The efficiency gained from joint quantisation permits the use of the mean squared error distortion criterion for cluster quantiser selection, rather than the computationally expensive spectral distortion. The distortion performance gains come at the cost of an increase in computational complexity and memory. Experiments on narrowband speech from the TIMIT database demonstrate that the multi-frame GMM-based block quantiser can achieve a spectral distortion of 1 dB at 22 bits/frame, or 21 bits/frame with some added complexity.

Introduction

Linear predictive coding (LPC) of speech requires the accurate quantisation of parameters representing the spectral envelope. Speech is windowed into frames and the spectral envelope is parametrically modelled as an all-pole filter, whose coefficients are called linear predictive coding (LPC) parameters. These LPC parameters are generally quantised in terms of line spectral frequencies (LSFs) using a vector quantiser (VQ). Extrapolating from the operating curve of full search VQ suggests that we need about 19 bits/frame to achieve transparent coding of these parameters (Paliwal and Kleijn, 1995), while high rate analysis predicts a lower bound of 23 bits/frame¹ (Hedelin and Skoglund, 2000). It is not possible to design codebooks at these rates and in addition, the computational cost of the resulting full search vector quantiser is very high.

Less complex but suboptimal vector quantisers such as multistage and split VQ have been investigated in the speech coding literature (LeBlanc et al., 1993, Paliwal and Atal, 1993), where it was generally observed that 22–24 bits/frame were required to achieve transparent coding² in speech, with varying degrees of complexity. Further gains in performance can be achieved by exploiting temporal correlation between successive frames. Matrix quantisation (Tsao and Gray, 1985) and its derivatives such as split matrix quantisation (Xydeas and Papanastasiou, 1999) and multi-mode matrix quantisation (Nurminen et al., 2003, Sinervo et al., 2003) perform better by jointly quantising LSF frames.

The use of Gaussian mixture models (GMM) for the coding of LSFs has been investigated in (Hedelin and Skoglund, 2000, Shabestary and Hedelin, 2002, Subramaniam and Rao, 2000, Subramanian and Rao, 2001, Subramaniam and Rao, 2003). In (Subramaniam and Rao, 2003), a Gaussian mixture model (GMM) is used to parameterise the probability density function (PDF) of the source and optimised Gaussian block quantisers are designed for each cluster (or, mixture component).³ Using this quantiser in its fixed rate mode, a spectral distortion of approximately 1 dB was achieved at 24 bits/frame. The main advantages of this scheme over vector quantisers include (Subramaniam and Rao, 2003):

1.
lower complexity through the use of block quantisers;
2.
bitrate scalability; and
3.
search complexity and memory requirements being independent of the rate of the system.

A modified quantiser with memory was also described in (Subramaniam and Rao, 2003) that coded the difference between successive frames, similar to differential pulse code modulation (DPCM) with a one-tap predictor. A spectral distortion of 1 dB was achieved at 22 bits/frame (Subramaniam and Rao, 2003). During the coding process, there is frequent use of the spectral distortion (SD) calculation for cluster quantiser selection. While there are approximate high-rate expressions for the spectral distortion calculation (Gardner and Rao, 1995), the number of computations is still comparatively higher than mean squared error (MSE).

In this paper, we investigate a modified version of the fixed-rate GMM-based block quantiser that operates on multiple frames and uses the mean squared error (MSE) distortion criterion.⁴ We have found this system to perform better than the single frame as well as predictive quantiser of (Subramaniam and Rao, 2003) in terms of spectral distortion.

The organisation of this paper is as follows. Section 2 introduces some preliminaries such as the line spectral frequency representation of LPC parameters and distortion measures that are commonly used in speech coding. In Section 3, we describe the operation of the multi-frame GMM-based block quantiser as well as its computational and memory requirements. Section 4 details the LPC analysis method and speech database that we have used to evaluate the performance of the quantiser. Following this is a discussion of the performance of the multi-frame GMM-based block quantiser and how it compares with other quantisation schemes. Finally we conclude in Section 6.

Section snippets

LSF representation of LPC coefficients

In the LPC analysis of speech, a short segment of speech is assumed to be the output of an all-pole filter, $H (z) = \frac{1}{A (z)}$ , driven by white Gaussian noise, where A(z) is the inverse filter given by (Paliwal and Atal, 1993): $A (z) = 1 + a_{1} z^{- 1} + a_{2} z^{- 2} + \dots + a_{n} z^{- n}$ Here, n is the order of LPC analysis and ${a_{i}}_{i = 1}^{n}$ are the LPC coefficients. Because H(z) is used to reconstruct speech in linear predictive speech coders, its stability is of utmost importance and cannot be ensured when LPC coefficients are coded

Multi-frame GMM-based block quantisation

The multi-frame GMM-based block quantiser is based on the memoryless version proposed by Subramaniam and Rao (2003) for the coding of speech line spectral frequencies (LSF), where a Gaussian mixture model (GMM) is used to parametrically model the probability density function (PDF) of the source and block quantisers are then designed for each Gaussian mixture component (or, cluster). This modified scheme exploits interframe correlation by concatenating p successive frames into a larger vector.

Experimental setup

The TIMIT database was used to train and test the various quantisation schemes. It consists of speech down-sampled to 8 kHz with a 3.4 kHz anti-aliasing filter applied. A 20 ms Hamming window is used and a tenth order linear predictive analysis is performed on each frame using the autocorrelation method (Paliwal and Kleijn, 1995). There is no overlap between successive speech frames. High frequency compensation and bandwidth expansion of 15 Hz

Spectral distortion performance of the 16 cluster multi-frame GMM-based block quantiser

Table 3 shows the spectral distortion performance of the 16 cluster, multi-frame GMM-based block quantiser for varying bitrates and number of concatenated frames, p. A spectral distortion of 1 dB has been achieved at 22 bits/frame with p = 3. For any given bitrate, the spectral distortion decreases as more frames are concatenated together. This may be attributed to the decorrelation of LSFs within and across frames by the KLT. Because the dimension of the vectors is larger, the block quantiser can

Conclusion

In this paper, we have investigated the multi-frame GMM-based block quantiser for the coding of line spectral frequencies. By concatenating multiple frames together, correlation between LSFs within each frame and across successive frames can be exploited by the KLT, leading to better coding. The efficiency gained from joint quantisation permits the use of the mean squared error distortion criterion for cluster quantiser selection, rather than the computationally expensive spectral distortion,

References (28)

N. Sugamura et al.
Speech analysis and synthesis methods developed at ECL in NTT–from LPC to LSP–
Speech Commun.
(1986)
B.S. Atal et al.
Predictive coding of speech signals and subjective error criteria
IEEE Trans. Acoust., Speech, Signal Process.
(1979)
Campbell, Jr., J.P., Welch, V.C., Tremain, T.E., 1989. An expandable error-protected 4800 bps CELP Coder (U.S. Federal...
A.P. Dempster et al.
Maximum likelihood from incomplete data via the EM algorithm
J. Roy. Stat. Soc.
(1977)
W.R. Gardner et al.
Theoretical analysis of the high-rate vector quantization of LPC parameters
IEEE Trans. Speech Audio Process.
(1995)
A. Gersho et al.
Vector Quantization and Signal Compression
(1992)
A. Gray et al.
Quantization and bit allocation in speech processing
IEEE Trans. Acoust., Speech, Signal Process.
(1976)
P. Hedelin et al.
Vector quantization based on Gaussian mixture models
IEEE Trans. Speech Audio Process.
(2000)
J.J.Y. Huang et al.
Block quantization of correlated Gaussian random variables
IEEE Trans. Commun. Syst.
(1963)
F. Itakura
Line spectrum representation of linear predictive coefficients of speech signals
J. Acoust. Soc. Am.
(1975)

F. Itakura et al.

Speech analysis-synthesis based on the partial autocorrelation coefficient

Proc. JSA

(1969)

P. Kroon et al.

Linear-prediction based analysis-by-synthesis coding

W.P. LeBlanc et al.

Efficient search and design procedures for robust multi-stage VQ of LPC parameters for 4 kb/s speech coding

IEEE Trans. Speech Audio Process.

(1993)

Y. Linde et al.

An algorithm for vector quantizer design

IEEE Trans. Commun.

(1980)

Cited by (7)

Acoustic emission characteristics of stainless steel reinforced geopolymer coral concrete beams under four-point bending
2023, Structures
The fabrication of geopolymer coral aggregate concrete beams (GCACB) can effectively address the challenges associated with constrained concrete raw materials and structural erosion in island construction. Acoustic emission technology (AE) presents an appealing solution for non-destructive evaluation and structural health monitoring. In this study, 6 specimens were conducted to examine the flexural behavior of GCACB accounting for variation in the reinforcement ratios under the four-point bending test, simultaneously recording the corresponding AE data. Then, the effect of the AE parameters on crack development was analyzed. By utilizing the GMM algorithm, the crack pattern classification was investigated. Finally, the GCACB damage was evaluated based on two different AE b-value analysis methods. The findings indicate that the cracks within GCACB predominantly occur along the coral coarse aggregate with low strength and high brittleness. The presence of reinforcement ratio has an enhancement on both the flexural capacity and deformation capacity of GCACB while impeding the crack width. It seems that a positive correlation between increasing crack damage and cumulative hits and AE energy occurs. Based on the enhanced GMM model, the cracks are categorized into tensile crack, shear crack, and mixed crack, with tensile crack predominating. A probabilistic division line is established for the RA-AF relationship that progressively leans towards the AF axis as the reinforcement ratio increases. In contrast to the b-value obtained from the GBR method, the Aki method demonstrates enhanced predictive capability for GCACB damage.
Reduced complexity two stage vector quantization
2009, Digital Signal Processing: A Review Journal
We address the issue of complexity for vector quantization (VQ) of wide-band speech LSF (line spectrum frequency) parameters. The recently proposed switched split VQ (SSVQ) method provides better rate–distortion (R/D) performance than the traditional split VQ (SVQ) method, even at the requirement of lower computational complexity, but at the expense of much higher memory. We develop the two stage SVQ (TsSVQ) method, by which we gain both the memory and computational advantages and still retain good R/D performance. The proposed TsSVQ method uses a full dimensional quantizer in its first stage for exploiting all the higher dimensional coding advantages and then, uses an SVQ method for quantizing the residual vector in the second stage so as to reduce the complexity. We also develop a transform domain residual coding method in this two stage architecture such that it further reduces the computational complexity. To design an effective residual codebook in the second stage, variance normalization of Voronoi regions is carried out which leads to the design of two new methods, referred to as normalized two stage SVQ (NTsSVQ) and normalized two stage transform domain SVQ (NTsTrSVQ). These two new methods have complimentary strengths and hence, they are combined in a switched VQ mode which leads to the further improvement in R/D performance, but retaining the low complexity requirement. We evaluate the performances of new methods for wide-band speech LSF parameter quantization and show their advantages over established SVQ and SSVQ methods.
A comparative study of LPC parameter representations and quantisation schemes for wideband speech coding
2007, Digital Signal Processing: A Review Journal
In this paper, we provide a review of LPC parameter quantisation for wideband speech coding as well as evaluate our contributions, namely the switched split vector quantiser (SSVQ) and multi-frame GMM-based block quantiser. We also compare the performance of various quantisation schemes on the two popular LPC parameter representations: line spectral frequencies (LSFs) and immittance spectral pairs (ISPs). Our experimental results indicate that ISPs are superior to LSFs by 1 bit/frame in independent quantiser schemes, such as scalar quantisers; while LSFs are the superior representation for joint vector quantiser schemes. We also derive informal lower bounds, 35 and 36 bits/frame, for the transparent coding of LSFs and ISPs, respectively, via the extrapolation of the operating distortion-rate curve of the unconstrained vector quantiser. Finally, we report and discuss the results of applying the SSVQ with dynamically-weighted distance measure and the multi-frame GMM-based block quantiser, which achieve transparent coding at 42 and 37 bits/frame, respectively, for LSFs. ISPs were found to be inferior to the LSFs by 1 bit/frame. In our comparative study, other quantisation schemes that were investigated include PDF-optimised scalar quantisers, the memoryless Gaussian mixture model-based block quantiser, the split vector quantiser, and the split-multistage vector quantiser with MA predictor from the AMR-WB (ITU-T G.722.2) speech coder.
Acoustic Emission Analyzing the Crack Classification and Damage Evolution on Stainless Steel Rebar Reinforced Geopolymer Coral Aggregate Concrete Beam
2023, SSRN
Gaussian mixture model-based quantization of line spectral frequencies for adaptive multirate speech codec
2011, Journal of Computing and Information Technology
Low complexity wideband LSF quantization using GMM of uncorrelated Gaussian mixtures
2008, European Signal Processing Conference

View all citing articles on Scopus

View full text

Multi-frame GMM-based block quantisation of line spectral frequencies

Abstract

Introduction

Section snippets

LSF representation of LPC coefficients

Multi-frame GMM-based block quantisation

Experimental setup

Spectral distortion performance of the 16 cluster multi-frame GMM-based block quantiser

Conclusion

Speech Commun.

Predictive coding of speech signals and subjective error criteria

IEEE Trans. Acoust., Speech, Signal Process.

Maximum likelihood from incomplete data via the EM algorithm

J. Roy. Stat. Soc.

Theoretical analysis of the high-rate vector quantization of LPC parameters

IEEE Trans. Speech Audio Process.

Vector Quantization and Signal Compression

Quantization and bit allocation in speech processing

IEEE Trans. Acoust., Speech, Signal Process.

Vector quantization based on Gaussian mixture models

IEEE Trans. Speech Audio Process.

Block quantization of correlated Gaussian random variables

IEEE Trans. Commun. Syst.

Line spectrum representation of linear predictive coefficients of speech signals

J. Acoust. Soc. Am.

Speech analysis-synthesis based on the partial autocorrelation coefficient

Proc. JSA

Linear-prediction based analysis-by-synthesis coding

Efficient search and design procedures for robust multi-stage VQ of LPC parameters for 4 kb/s speech coding

IEEE Trans. Speech Audio Process.

An algorithm for vector quantizer design

IEEE Trans. Commun.