Elsevier

Signal Processing

Volume 87, Issue 11, November 2007, Pages 2607-2628
Signal Processing

New signal decomposition method based speech enhancement

https://doi.org/10.1016/j.sigpro.2007.04.014Get rights and content

Abstract

The auditory system, like the visual system, may be sensitive to abrupt stimulus changes, and the transient component in speech may be particularly critical to speech perception. If this component can be identified and selectively amplified, improved speech perception in background noise may be possible. This paper describes an algorithm to decompose speech into tonal, transient, and residual components. The modified discrete cosine transform (MDCT) was used to capture the tonal component and the wavelet transform was used to capture transient features. A hidden Markov chain (HMC) model and a hidden Markov tree (HMT) model were applied to capture statistical dependencies between the MDCT coefficients and between the wavelet coefficients, respectively. The transient component identified by the wavelet transform was selectively amplified and recombined with the original speech to generate modified speech, with energy adjusted to equal the energy of the original speech. The intelligibility of the original and modified speech was evaluated in eleven human subjects using the modified rhyme protocol. Word recognition rate results show that the modified speech can improve speech intelligibility at low SNR levels (8% at -15dB, 14% at -20dB, and 18% at -25dB) and has minimal effect on intelligibility at higher SNR levels.

Introduction

The auditory system, like the visual system, may be sensitive to abrupt stimulus changes, and the transient component in speech may be particularly critical to speech perception. If this component can be identified and selectively amplified, improved speech perception in background noise may be possible. This suggests an approach to improving the intelligibility of speech in background noise that differs from previous speech enhancement approaches. Speech enhancement in past decades has emphasized minimizing the effects of noise [1], but our approach is to modify the speech itself, by the use of the transient component, to improve its intelligibility. The particular application we are considering is radio communication between a centrally located coordinator in a quiet environment with field operators in noisy environments, such as an operator at a command post communicating with a squad deployed in the field. Noise conditions in these environments can be very difficult, with signal-to-noise ratios (SNR) well below 0 dB. We assume for the proposed work that virtually noise-free original speech is available for processing before it is sent to the listener.

Hazan and Simpson investigated the effect of cue-enhancement on the intelligibility of vowel consonant vowel (VCV) nonsense word and sentence materials presented in noise [2]. They manually annotated regions of high concentration of acoustic cues, which are an inherently transient and the perceptually important formant transitions following plosive release. These cues were selectively enhanced to increase speech intelligibility. They used a manual approach to avoid effects of errors that can occur in an automatic speech enhancement approach. Moreover, this approach allowed the evaluation effects of various types and degrees of enhancements. The experimental results tested at the SNR levels of 0 and -5dB showed the improvement of speech intelligibility in background noise about 10%. They concluded that speech enhancement based on the knowledge of acoustic cues can improve speech perception in poor listening condition [2].

Yoo developed an approach to identify a transient component using time-varying bandpass filters [3]. The original (unprocessed) speech was high-pass filtered at 700 Hz, and three time-varying bandpass filters were used to remove the dominant formants leaving the transient component. The original speech was modified by amplifying the transient component and adding it to the original speech. Yoo found that the intelligibility of the modified speech was significantly greater (up to 30%) than that of the original speech at SNR of -25, -20, and -15dB. However, the transient obtained by Yoo appeared to retain a significant amount of formant energy during what would appear to be tonal regions of the speech. Furthermore, high-frequency emphasis or removal of the first formant can have a significant improvement on speech intelligibility [4]. The purpose of this paper is to describe an alternative approach to define speech transients to determine whether it might be more effective in improving speech intelligibility without an improvement due to high-frequency emphasis.

Daudet and Torrésani proposed an approach to achieve a low bit rate in coding a musical signal. Their approach used a transform coding to decompose the musical signal into tonal, transient, and residual components and separately coded the individual components to produce more efficient coding [5]. The original signal was transformed using the modified discrete cosine transform (MDCT), which provides good estimates of locally stationary signals [5]. The tonal component was estimated by the inverse transform of a small number of MDCT coefficients whose absolute values exceeded a selected threshold. The tonal component was subtracted from the original signal to obtain what they defined as the non-tonal component. The non-tonal component was transformed using the wavelet transform, which provides good results in encoding signals with abrupt temporal changes. The transient component was estimated by the inverse of the wavelet transform, using a small number of wavelet coefficients whose absolute values exceeded another selected threshold. The residual component, obtained by subtracting the transient component from the non-tonal component, was expected to be a stationary random process with a flat spectrum. Daudet and Torrésani's approach was effective in reducing the bit rate required to code the signal. However, this algorithm relied on empirical thresholds to determine the significant MDCT and wavelet coefficients, and the most appropriate thresholds are not known. We wanted to use their modeling approach without having to define empirical thresholds.

Another potential limitation of Daudet and Torrésani's approach is that the MDCT coefficients and the wavelet coefficients may show statistical dependencies. Crouse et al. have suggested that the wavelet coefficients have statistical dependencies described by clustering and persistence properties [6]. Daudet et al. have suggested that the same may be true for the MDCT coefficients [7]. These properties are described for both the MDCT and wavelet coefficients as follows. For the clustering property, if a particular MDCT/wavelet coefficient is large or small, then the adjacent MDCT/wavelet coefficients are likely to be large or small, respectively. For the persistence property, large or small values of MDCT/wavelet coefficients have a tendency to promulgate across frequencies/scales.

Crouse et al. developed a probabilistic model to capture complex dependencies and non-Gaussian statistics of the wavelet transform. They used the model, called the hidden Markov tree (HMT) model, to describe the statistical dependencies of the wavelet coefficients along and across scale, based on clustering and persistence properties, by utilizing Markov dependencies [6]. They modeled the wavelet coefficients as a two-state, zero-mean Gaussian mixture, where “large” and “small” states were associated with large and small variance, zero-mean Gaussian distributions, respectively. The wavelet coefficients were observed but the state variables were hidden. Having introduced the upward–downward algorithm for training the model, they compared the denoising performance of wavelet HMT to state-of-the-art wavelet denoising methods (SureShrink [8], Bayesian [9], and Independent Mixture [6]) on various noisy signals including bumps, blocks, doppler, and heavisine [6]. The denoised signals using HMT showed significant improvements by having smaller mean-squared errors compared to other methods.

Molla and Torrésani proposed that capturing statistical dependencies of the wavelet coefficients would improve the identification of the transient component in a musical signal [10]. The HMT model was used to capture the dependencies [10]. They associated the transient state with a large-variance Gaussian distribution and the residual state with a small-variance Gaussian distribution. They used the statistical inference method [11], which is more robust to the numerical underflow problem than is the upward–downward algorithm.

Daudet et al. proposed a probabilistic model to estimate the tonal component in a musical signal [7]. They applied a hidden Markov chain (HMC) model [12] to describe the statistical dependencies of the MDCT coefficients in each frequency index. They modeled the MDCT coefficients as a two-state, zero-mean Gaussian mixture. A tonal state was associated with a large-variance Gaussian distribution, and a non-tonal state was associated with a small-variance Gaussian distribution.

These researchers [5], [6], [7], [10] did not specifically apply their approaches to speech. Our objective was to develop an algorithm based on these models to isolate the transient component in speech and evaluate the use of this component to improve speech intelligibility in background noise. An early version of the algorithm has been described [13], where the MDCT coefficients and the wavelet coefficients were assumed to be a mixture of two univariate Gaussian distributions, following the previous researchers [5], [6], [7], [10] for the sake of simplicity. The present paper describes the final form of our algorithm in detail and expands on the previous presentation of results, including direct comparisons with algorithms of other investigators, speech decomposition of noisy speech, the effect of a traditional noise suppressor (spectral subtraction) to the benefits of the modified speech corrupted by a high level of background noise, and analysis of phoneme confusions in word recognition.

Details of the algorithm are described in Section 2. Examples of speech decomposition results in clean and noisy conditions are described in Section 3. In Section 4, the transient components identified by our algorithm, an implementation of Daudet and Torrésani's algorithm, and results obtained by Yoo are compared. The intelligibility of the original speech and speech modified by adding an amplified transient component were evaluated by the modified rhyme test, as described in Section 5, and the experimental results are presented. These test results have been briefly described previously [14]. This paper adds an analysis of confusions across phonetic categories at -25dB. To investigate whether emphasis in high-frequency regions improves speech intelligibility in background noise, the transient component was high-pass filtered and then was used to generate another version of the modified speech as described in Section 6. To investigate whether conventional noise reduction techniques affect the benefits of the modified speech especially in transients, spectral subtraction was chosen as a traditional noise suppressor to enhance the modified speech corrupted by a high level of background noise as described in Section 7. The implications of these results and limitations in extending them to practical situations are discussed in Section 8.

Section snippets

The modified discrete cosine transform (MDCT)

The MDCT was introduced by Princen and Bradley [15] based on the concept of time domain aliasing cancelation (TDAC). It is a Fourier-related transform, based on the type-IV discrete cosine transform (DCT-IV) [16]. It is also referred to as the perfect reconstruction cosine modulated filter bank with some restrictions on the window w(n) [17].

In the forward MDCT, the input signal, x(n), is divided into frames (each frame with the length of M samples). Then, a block transform of length 2M samples (

Speech decomposition results

Fifty monosyllabic CVC words from NU-6 word list [20] and 300 rhyming words from House et al. [21] were decomposed using the algorithm discussed above. The tonal component predominantly included constant frequency information of vowel formants and consonant hubs. The tonal component included most of the energy of the original speech (96.80% of the total speech energy), but this component was difficult to recognize as the original speech. The transient component included comparatively little

Comparisons of transient components identified by various algorithms

If our method captures statistical dependencies between the MDCT coefficients and between the wavelet coefficients, it should provide more effective identification of the transient components compared to an algorithm that ignores these dependencies. To investigate this suggestion, the transient components identified by our algorithm, an implementation of Daudet and Torrésani's algorithm, and Yoo's algorithm were compared. The transient components from our algorithm were decomposed by the

Modifying speech to improve intelligibility

To investigate the possibility that the transient component of speech can be used to improve speech recognition in background noise, the transient component identified by our algorithm was selectively amplified and recombined with the original speech. The energy of modified speech was adjusted to be equal to the energy of the original speech, and the intelligibility of the original and modified speech was evaluated in 11 subjects using the modified rhyme protocol—a method to measure speech

Modified speech emphasized in high frequency regions

The objective of this evaluation is to examine how high-frequency emphasis affects the intelligibility of speech in background noise. The motivation of this study is that the intelligibility of the modified speech generated by the algorithm of Yoo [3] is better than that of the modified speech generated by our method. These are based on the results of two psychoacoustic experiments performed at the Department of Communication Science and Disorders, University of Pittsburgh using the same test

The effect of conventional noise reduction to the benefits of the modified speech

The motivation of this study is that it is well known that the conventional noise reduction methods of the enhanced components such as spectral subtraction, adaptive filtering, adaptive noise cancelation, and harmonic selection face problems especially at transients. This experiment is to investigate whether the benefits of the modified speech are still obtained if a conventional noise reduction method is applied. Specifically, spectral subtraction applied to the modified speech corrupted by a

Discussion

We have presented a method to identify transient information in speech using MDCT-based hidden Markov chain (HMC) and wavelet-based hidden Markov tree (HMT) models. Our algorithm, a modification of Daudet and Torrésani [5], avoids thresholds and describes the clustering and persistence statistical dependencies between the MDCT coefficients and between the wavelet coefficients. Although there is no quantitative definition of the transient component of speech, we expect the transient component to

References (31)

  • L. Daudet et al.

    Hybrid representation for audiophonic signal encoding

    Signal Processing

    (2002)
  • J.S. Lim et al.

    Enhancement and bandwidth compression of noisy speech

    Proc. IEEE

    (1979)
  • V. Hazen et al.

    The effect of cue-enhancement on the intelligibility of nonsense word and sentence materials presented in noise

    Speech Commun.

    (1998)
  • S. Yoo, Speech decomposition and speech enhancement, Ph.D. thesis, Department of Electrical and Computer Engineering,...
  • R.J. Niederjohn et al.

    The enhancement of speech intelligibility in high noise levels by high-pass filtering followed by rapid amplitude compression

    IEEE Trans. Acoust. Speech Signal Process.

    (1976)
  • M.S. Crouse et al.

    Wavelet-based statistical signal processing using hidden Markov models

    IEEE Trans. Signal Process.

    (1998)
  • L. Daudet et al.

    Towards a hybrid audio coder

  • D. Donoho et al.

    Adapting to unknown smoothness via wavelet shrinkage

    Internat. Amer. Statist. Assoc.

    (1995)
  • H. Chipman et al.

    Adaptive Bayesian wavelet shrinkage

    Internat. Amer. Statist. Assoc.

    (1997)
  • S. Molla, B. Torrésani, Hidden Markov tree of wavelet coefficients for transient detection in audiophonic signals, in:...
  • J.B. Durand, P. Gonçalvès, Statistical inference for hidden Markov tree models and application to wavelet trees,...
  • L.R. Rabiner

    A tutorial on hidden Markov models and selected applications in speech recognition

    Proc. IEEE

    (1989)
  • C. Tantibundhit et al.

    Automatic speech decomposition and speech coding using MDCT-based hidden Markov chain and wavelet-based hidden Markov tree models

  • C. Tantibundhit et al.

    Speech enhancement using transient speech components

  • J.P. Princen et al.

    Analysis/synthesis filter bank design based on time domain aliasing cancellation

    IEEE Trans. Acoust. Speech Signal Process.

    (1986)
  • Cited by (0)

    1

    This work is supported by the Office of Naval Research under the grant number N000140310277.

    View full text