Elsevier

Signal Processing

Volume 106, January 2015, Pages 266-281
Signal Processing

Joint source separation and dereverberation using constrained spectral divergence optimization

https://doi.org/10.1016/j.sigpro.2014.08.009Get rights and content

Highlights

  • A novel method for joint source separation and dereverberation in an NMF framework is proposed.

  • The method uses constrained spectral divergence minimization by imposing non-negative constraints on sub-band envelopes.

  • The group delay spectrum is utilized for source separation and dereverberation.

  • Accurate NMF decompositions are obtained due to the robustness and high spectral resolution of the group delay spectrum.

Abstract

A novel method of joint source separation and dereverberation that minimizes the divergence between the observed and true spectral subband envelopes is discussed in this paper. This divergence minimization is carried out within the non-negative matrix factorization (NMF) framework by imposing certain non-negative constraints on the subband envelopes. Additionally, the joint source separation and dereverberation framework described herein utilizes the spectral subband envelope obtained from group delay spectral magnitude (GDSM). In order to obtain the spectral subband envelope from the GDSM, the equivalence of the magnitude and the group delay spectrum via the weighted cepstrum is used. Since the subband envelope of the group delay spectral magnitude is robust and has a high spectral resolution, less error is noted in the NMF decomposition. Late reverberation components present in the separated signals are then removed using a modified spectral subtraction technique. The quality of separated and dereverberated speech signal is evaluated using several objective and subjective criteria. Experiments on distant speech recognition are then conducted at various direct-to-reverberant ratios (DRR) on the GRID corpus. Experimental results indicate significant improvements over existing methods in the literature.

Introduction

The objective of any source separation method is to recover the original signals from a composite signal. The problem of separation becomes more difficult when signals are mixed under a reverberant environment. Reverberation occurs when the distance between the speaker and the microphone is large enough to create multiple paths for the speech signal to arrive at the microphone. The reverberation results in degradation in intelligibility of the speech signal and the speech recognition performance.

Several algorithms have been developed for single channel speech dereverberation. The temporal averaging method is proposed in Xizhong and Guang [55] to estimate the room acoustic impulse response (AIR). This is done by complex cepstrum which utilizes an adaptive segmentation technique. The inverse filter solution is then obtained after pre-estimation of RIR. In Furuya et al. [18], the blind estimation of inverse filters required to obtain the dereverberated signal is explained. The inverse filters in Furuya et al. [18] are estimated by computing the correlation matrix between input signals, instead of room impulse response. A two stage algorithm for a single microphone has been proposed in Wu and Wang [54], where an inverse filter is estimated to reduce the coloration effects during the first stage. The spectral subtraction is then applied as a post-processing step to minimize the influence of long-term reverberation. In Tomar [53], the maximum kurtosis of the speech residual is proposed for blind dereverberation of speech signal. A non-negative matrix factorization (NMF) method which utilizes gammatone subband magnitude domain dereverberation is proposed in Kumar et al. [31]. In Kumar et al. [31], the Fourier transform spectral magnitude is used in an NMF framework for automatic speech recognition (ASR) applications. In Bees et al. [5], the dereverberation is carried out by using cepstrum to determine the acoustic impulse response and then used for inverse filtering to obtain the estimate of clean speech. The truncation error present in Bees et al. [5] is removed in Xizhong and Guang [55], but still inverse filtering is required. Authors in Kameoka et al. [29] present a blind dereverberation method designed to recover the subband envelope of an original speech signal from its reverberant version. The problem is formulated as a blind deconvolution problem with non-negative constraints, regularized by the sparse nature of speech spectrograms. In Nakatani et al. [39], a harmonicity-based dereverberation method is discussed to reduce the amount of reverberation in the signal picked up by a single microphone. A variant of spectral subtraction described in Kinoshita et al. [30] utilizes multi-step forward linear prediction for speech dereverberation. It precisely estimates and suppresses the late reverberations, which result in enhancing the ASR performance. All these methods deal withspeech dereverberation problem in a single source environment.

A considerable work has also been done to address the source separation problem under an anechoic environment. In an instantaneous frequency method [23], the objective is to extract target component of speech mixed with interfering speech and to improve the recognition accuracy that is obtained using the recovered speech signal. The instantaneous frequency is used to reveal the underlying harmonic structures of a complex auditory scene. In latent variable decomposition [45], each magnitude spectral vector of speech signal is represented as the outcome of a discrete random process. The latent dirichlet decomposition method [44] is a generalization of latent variable decomposition that models distribution process as a mixture of multinomial distributions. In this model, the mixture weights of the component multinomial vary from one analysis window to the other. Non-negative matrix factorization [48], [33], [49] is also an effective method in the context of mixed speaker separation by decomposing the STFT magnitude matrix [21]. A convolutive version of NMF is described in Smaragdis [52] that utilizes temporal variations into the account for source separation. The single channel separation of speech and music is discussed in Litvin et al. [34] by utilizing discrete energy separation algorithm (DESA). Apart from the single channel, multi-channel underdetermined blind source separation in an anechoic environment is discussed in Bofill and Zibulevsky [8] and Niknazar et al. [41]. In Bertrand and Moonen [7], a non-negative BSS in a noise free environment using multiplicative updates and subspace projection is presented.

In general, the problem of source separation and dereverberation is looked at separately and solutions have been proposed for each of them individually as can be noted from the aforementioned discussion. However, the efforts have also been made in addressing the joint source separation and dereverberation problem. The joint optimization method for blind source separation (BSS) and dereverberation for multi-channel is discussed in Yoshioka et al. [60] by optimizing the parameters for the prediction matrices and for the separation matrices. A BSS framework in a noisy and reverberant environment based on a matrix formulation is proposed in Aichner et al. [1]. The method in Aichner et al. [1] allows simultaneous exploitation of nonwhiteness and nonstationarity of the source signals using second-order statistics. In Xu et al. [56], the joint block Toeplitzation and block-inner diagonalization (JBTBID) of a set of correlation matrices of the observed vector sequence is obtained for convolutive BSS. In Yoshioka et al. [59], the conditional separation and dereverberation method (CSD) for simultaneously achieving blind source separation and dereverberation of sound mixtures is discussed. A tractable BSS framework is explained in Arberet et al. [4] for estimating and combining spectral source models from noisy source estimates. In Rotili et al. [47], a general broadband approach to BSS for convolutive mixtures based on second-order statistics is discussed. The optimum inverse filtering algorithm based on the Bezouts theorem is used in the dereverberation stage. This is computationally more efficient and allows the inversion of long impulse responses in real-time applications. An integrated method for joint multi-channel blind dereverberation and separation of convolutive audio mixtures is discussed in Yoshioka et al. [58]. All the above methods follow the tandem approach to solve the separation and reverberation problem for multi-channel scenario. Additionally, the above joint blind source separation and dereverberation methods require multi-channel input. This assumption has been relaxed in this work by considering the single channel case.

The contributions of the paper are as follows. The paper proposes a new model for joint blind source separation and dereverberation for the single channel under a multisource environment. In this work, the different impulse response is considered for different location of the speakers. Additionally, the proposed method uses subband envelope of the mixed speaker signal computed from group delay spectral magnitude (GDSM) [57], [38] within the NMF framework. Due to the high resolution property of group delay function [57], [24], [3], [10], [9], this method reduces the error in the decomposition of observed subband envelope (OSE) sequence of the mixed signal into its constituent convolutional components.

In this work, the spectral divergence between observed subband envelope and true subband envelope (TSE) is minimized within the NMF framework. The convolutional components satisfying the non-negative constraint are then updated in an iterative manner. Once the subband envelope updates are obtained for each speaker, the spectral magnitude is then obtained by computing square root operation on the corresponding subband envelopes. Due to a fixed number of iterations in an NMF processing, some amounts of late reverberation and residual noise are still present in the updates of separated spectral magnitude. Hence, the remaining late reverberation and noise components are removed by post-processing methods. The experiments on source separation and speech dereverberation are conducted on the GRID corpus [13]. The performance of the proposed method indicates reasonable improvements over other conventional methods in the literature. Additionally, the experiments on distant speech recognition are conducted at various distances between the microphone and the speaker to evaluate the effect of distance on the performance of speech recognition system. The rest of the paper is organized as follows. Section 2 describes the model for source separation under a reverberant environment. In Section 3, the formulation of source separation problem using constrained spectral divergence is discussed. The significance of the group delay spectral magnitude in the proposed framework is explained in Section 4. The algorithm for joint source separation and dereverberation is presented in Section 5. The performance evaluation for the proposed method is discussed in Section 6. Section 7 presents a brief conclusion.

Section snippets

Model for source separation under a reverberant environment

The system model for source separation under a reverberant environment is formulated herein. Fig. 1 illustrates the model for reverberation of two sources mixed at a single microphone under noise. Let the subband envelope for two speaker signals is denoted by L(m,k) and F(m,k). Here, m is the frame index and k{1,,K} corresponds to the frequency bin index. K is the total number of subbands in each frame. The subband envelope of room impulse response (RIR) associated with two speaker signals is

Formulation of source separation problem using constrained spectral divergence optimization

In this section, the joint source separation and dereverberation in the subband envelope domain model is discussed as shown in Fig. 1. This model tries to estimate the clean spectrum of two speakers through a decomposition of the subband envelope of mixed reverberated speech signal Y(m,k) into its convolutive components L(m,k), H1(m,k) and F(m,k), H2(m,k) respectively. To achieve this decomposition, a divergence criterion is formulated. In this work, a priori knowledge of nature of H1(m,k), H2(m

Incorporating the group delay spectral magnitude in the proposed framework

In this section, the importance of the group delay spectral magnitude is discussed in the context of joint source separation and dereverberation. The high resolution and robustness properties [24] of group delay spectral magnitude result in smooth and robust subband envelopes. This reduces the error in an NMF decomposition of observed subband envelope of mixture into its convolutional components. This is primarily due to the accurate decomposition of the subband envelope computed from GDSM

Algorithm for joint source separation and dereverberation

The block diagram of the proposed joint source separation and dereverberation algorithm is illustrated in Fig. 6. The algorithmic steps involved in joint source separation and dereverberation are detailed in Algorithm 1. The spectrographic analysis of the proposed method is explained herein.

Performance evaluation

In this section, the experiments on source separation, dereverberation and distant speech recognition are evaluated. The performance of the source separation is evaluated in terms of subjective, objective and target-to-interference ratio (TIR) measures. The reconstructed target signal obtained from the proposed GDSM method is compared with other separation methods at various TIRs. Additionally, the experiments are conducted to evaluate the quality of speech dereverberation using objective

Conclusions

A method for performing joint source separation and dereverberation by minimizing the divergence between the observed and true subband envelopes obtained from the group delay spectral magnitude (GDSM) is proposed in this work. Advantages of the GDSM include robustness to noise and reverberation when compared to FFT spectral magnitude. Due to the high resolution property of group delay spectral magnitude, this method reduces the error in the decomposition of mixed signal into its convolutional

Acknowledgment

This paper is supported and funded by IIT Kanpur MIPS Lab.

References (61)

  • D. Bees, M. Blostein, P. Kabal, Reverberant speech enhancement using cepstral processing, in: 1991 International...
  • J. Benesty et al.

    Speech Enhancement

    (2005)
  • B. Bozkurt, Zeros of the z-transform (zzt) representation and chirp group delay processing for the analysis of source...
  • I. BS, 1534-1, Method for the Subjective Assessment of Intermediate Quality Level of Coding Systems, vol. 14....
  • J. Campbell

    Speaker recognitiona tutorial

    Proc. IEEE

    (1997)
  • M. Cooke et al.

    An audio-visual corpus for speech perception and automatic speech recognition

    J. Acoust. Soc. Am.

    (2006)
  • J. Damaschke, R. Huber, V. Hohmann, B. Kollmeier, PRO-DASP: an audio quality testbench for optimizing low-power chip...
  • B. Dumortier, E. Vincent, et al., Blind rt60 estimation robust across room sizes and source distances, in: 2014 IEEE...
  • V. Emiya et al.

    Subjective and objective quality assessment of audio source separation

    IEEE Trans. Audio Speech Lang. Process.

    (2011)
  • J.S. Erkelens et al.

    Correlation-based and model-based blind single-channel late-reverberation suppression in noisy time-varying acoustical environments

    IEEE Trans. Audio Speech Lang. Process.

    (2010)
  • K. Furuya, S. Sakauchi, A. Kataoka, Speech dereverberation by combining mint-based blind deconvolution and modified...
  • J. Garofolo

    TIMIT: Acoustic-phonetic Continuous Speech Corpus, Linguistic Data Consortium, LDC93S1

    (1993)
  • D. Gelbart, N. Morgan, Double the trouble: handling noise and reverberation in far-field automatic speech recognition,...
  • E. Grais, H. Erdogan, Single channel speech music separation using nonnegative matrix factorization and spectral masks,...
  • E. Grais, H. Erdogan, Single channel speech music separation using nonnegative matrix factorization and spectral masks,...
  • L. Gu, Single-channel speech separation based on instantaneous frequency (Ph.D. thesis), Citeseer,...
  • R. Hegde et al.

    Significance of joint features derived from the modified group delay function in speech processing

    EURASIP J. Audio Speech Music Process.

    (2007)
  • R.M. Heiberger et al.

    One-way anova, in: R Through Excel

    (2009)
  • T. Houtgast et al.

    A review of the mtf concept in room acoustics and its use for estimating speech intelligibility in auditoria

    J. Acoust. Soc. Am.

    (1985)
  • Y. Hu, P.C. Loizou, Evaluation of objective measures for speech enhancement, in: Interspeech, Citeseer,...
  • Cited by (10)

    • Single channel speech dereverberation and separation using RPCA and SNMF

      2020, Applied Acoustics
      Citation Excerpt :

      Every CDAE has been trained to separate one speech by treating the other speech as noise. Several algorithms [26–30] have been proposed for speech separation, but the performance of these algorithms degrade with the intrusion of reverberation effects into the speech mixtures [31]. Furthermore, many dereverberation algorithms need a-priori knowledge about the RIRs which is mostly unavailable.

    • A Bayesian approach to convolutive nonnegative matrix factorization for blind speech dereverberation

      2018, Signal Processing
      Citation Excerpt :

      The first kind refers to those that begin with a training stage that serves to learn some characteristics of the reververation conditions, while the second kind alludes to those methods that can be implemented directly over the reverberant signal. Some supervised methods [13–15] appear to perform somewhat better than unsupervised ones, but they pose the disadvantage of needing learning data corresponding to the specific room conditions, microphone and source locations, and a previous process that might take a significant amount of time. In the context of unsupervised blind dereverberation, although some recently proposed methods [12,16] work reasonably well, there is still much room for improvement.

    • Exploiting Sparse Recovery Algorithms for Semi-Supervised Training of Deep Neural Networks for Direction-of-Arrival Estimation

      2023, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
    • Group Delay based Methods for Detection and Recognition of Whispered Speech

      2022, Proceedings - International Conference on Pattern Recognition
    View all citing articles on Scopus
    View full text