Joint source separation and dereverberation using constrained spectral divergence optimization
Introduction
The objective of any source separation method is to recover the original signals from a composite signal. The problem of separation becomes more difficult when signals are mixed under a reverberant environment. Reverberation occurs when the distance between the speaker and the microphone is large enough to create multiple paths for the speech signal to arrive at the microphone. The reverberation results in degradation in intelligibility of the speech signal and the speech recognition performance.
Several algorithms have been developed for single channel speech dereverberation. The temporal averaging method is proposed in Xizhong and Guang [55] to estimate the room acoustic impulse response (AIR). This is done by complex cepstrum which utilizes an adaptive segmentation technique. The inverse filter solution is then obtained after pre-estimation of RIR. In Furuya et al. [18], the blind estimation of inverse filters required to obtain the dereverberated signal is explained. The inverse filters in Furuya et al. [18] are estimated by computing the correlation matrix between input signals, instead of room impulse response. A two stage algorithm for a single microphone has been proposed in Wu and Wang [54], where an inverse filter is estimated to reduce the coloration effects during the first stage. The spectral subtraction is then applied as a post-processing step to minimize the influence of long-term reverberation. In Tomar [53], the maximum kurtosis of the speech residual is proposed for blind dereverberation of speech signal. A non-negative matrix factorization (NMF) method which utilizes gammatone subband magnitude domain dereverberation is proposed in Kumar et al. [31]. In Kumar et al. [31], the Fourier transform spectral magnitude is used in an NMF framework for automatic speech recognition (ASR) applications. In Bees et al. [5], the dereverberation is carried out by using cepstrum to determine the acoustic impulse response and then used for inverse filtering to obtain the estimate of clean speech. The truncation error present in Bees et al. [5] is removed in Xizhong and Guang [55], but still inverse filtering is required. Authors in Kameoka et al. [29] present a blind dereverberation method designed to recover the subband envelope of an original speech signal from its reverberant version. The problem is formulated as a blind deconvolution problem with non-negative constraints, regularized by the sparse nature of speech spectrograms. In Nakatani et al. [39], a harmonicity-based dereverberation method is discussed to reduce the amount of reverberation in the signal picked up by a single microphone. A variant of spectral subtraction described in Kinoshita et al. [30] utilizes multi-step forward linear prediction for speech dereverberation. It precisely estimates and suppresses the late reverberations, which result in enhancing the ASR performance. All these methods deal withspeech dereverberation problem in a single source environment.
A considerable work has also been done to address the source separation problem under an anechoic environment. In an instantaneous frequency method [23], the objective is to extract target component of speech mixed with interfering speech and to improve the recognition accuracy that is obtained using the recovered speech signal. The instantaneous frequency is used to reveal the underlying harmonic structures of a complex auditory scene. In latent variable decomposition [45], each magnitude spectral vector of speech signal is represented as the outcome of a discrete random process. The latent dirichlet decomposition method [44] is a generalization of latent variable decomposition that models distribution process as a mixture of multinomial distributions. In this model, the mixture weights of the component multinomial vary from one analysis window to the other. Non-negative matrix factorization [48], [33], [49] is also an effective method in the context of mixed speaker separation by decomposing the STFT magnitude matrix [21]. A convolutive version of NMF is described in Smaragdis [52] that utilizes temporal variations into the account for source separation. The single channel separation of speech and music is discussed in Litvin et al. [34] by utilizing discrete energy separation algorithm (DESA). Apart from the single channel, multi-channel underdetermined blind source separation in an anechoic environment is discussed in Bofill and Zibulevsky [8] and Niknazar et al. [41]. In Bertrand and Moonen [7], a non-negative BSS in a noise free environment using multiplicative updates and subspace projection is presented.
In general, the problem of source separation and dereverberation is looked at separately and solutions have been proposed for each of them individually as can be noted from the aforementioned discussion. However, the efforts have also been made in addressing the joint source separation and dereverberation problem. The joint optimization method for blind source separation (BSS) and dereverberation for multi-channel is discussed in Yoshioka et al. [60] by optimizing the parameters for the prediction matrices and for the separation matrices. A BSS framework in a noisy and reverberant environment based on a matrix formulation is proposed in Aichner et al. [1]. The method in Aichner et al. [1] allows simultaneous exploitation of nonwhiteness and nonstationarity of the source signals using second-order statistics. In Xu et al. [56], the joint block Toeplitzation and block-inner diagonalization (JBTBID) of a set of correlation matrices of the observed vector sequence is obtained for convolutive BSS. In Yoshioka et al. [59], the conditional separation and dereverberation method (CSD) for simultaneously achieving blind source separation and dereverberation of sound mixtures is discussed. A tractable BSS framework is explained in Arberet et al. [4] for estimating and combining spectral source models from noisy source estimates. In Rotili et al. [47], a general broadband approach to BSS for convolutive mixtures based on second-order statistics is discussed. The optimum inverse filtering algorithm based on the Bezouts theorem is used in the dereverberation stage. This is computationally more efficient and allows the inversion of long impulse responses in real-time applications. An integrated method for joint multi-channel blind dereverberation and separation of convolutive audio mixtures is discussed in Yoshioka et al. [58]. All the above methods follow the tandem approach to solve the separation and reverberation problem for multi-channel scenario. Additionally, the above joint blind source separation and dereverberation methods require multi-channel input. This assumption has been relaxed in this work by considering the single channel case.
The contributions of the paper are as follows. The paper proposes a new model for joint blind source separation and dereverberation for the single channel under a multisource environment. In this work, the different impulse response is considered for different location of the speakers. Additionally, the proposed method uses subband envelope of the mixed speaker signal computed from group delay spectral magnitude (GDSM) [57], [38] within the NMF framework. Due to the high resolution property of group delay function [57], [24], [3], [10], [9], this method reduces the error in the decomposition of observed subband envelope (OSE) sequence of the mixed signal into its constituent convolutional components.
In this work, the spectral divergence between observed subband envelope and true subband envelope (TSE) is minimized within the NMF framework. The convolutional components satisfying the non-negative constraint are then updated in an iterative manner. Once the subband envelope updates are obtained for each speaker, the spectral magnitude is then obtained by computing square root operation on the corresponding subband envelopes. Due to a fixed number of iterations in an NMF processing, some amounts of late reverberation and residual noise are still present in the updates of separated spectral magnitude. Hence, the remaining late reverberation and noise components are removed by post-processing methods. The experiments on source separation and speech dereverberation are conducted on the GRID corpus [13]. The performance of the proposed method indicates reasonable improvements over other conventional methods in the literature. Additionally, the experiments on distant speech recognition are conducted at various distances between the microphone and the speaker to evaluate the effect of distance on the performance of speech recognition system. The rest of the paper is organized as follows. Section 2 describes the model for source separation under a reverberant environment. In Section 3, the formulation of source separation problem using constrained spectral divergence is discussed. The significance of the group delay spectral magnitude in the proposed framework is explained in Section 4. The algorithm for joint source separation and dereverberation is presented in Section 5. The performance evaluation for the proposed method is discussed in Section 6. Section 7 presents a brief conclusion.
Section snippets
Model for source separation under a reverberant environment
The system model for source separation under a reverberant environment is formulated herein. Fig. 1 illustrates the model for reverberation of two sources mixed at a single microphone under noise. Let the subband envelope for two speaker signals is denoted by and . Here, m is the frame index and corresponds to the frequency bin index. K is the total number of subbands in each frame. The subband envelope of room impulse response (RIR) associated with two speaker signals is
Formulation of source separation problem using constrained spectral divergence optimization
In this section, the joint source separation and dereverberation in the subband envelope domain model is discussed as shown in Fig. 1. This model tries to estimate the clean spectrum of two speakers through a decomposition of the subband envelope of mixed reverberated speech signal into its convolutive components , and , respectively. To achieve this decomposition, a divergence criterion is formulated. In this work, a priori knowledge of nature of ,
Incorporating the group delay spectral magnitude in the proposed framework
In this section, the importance of the group delay spectral magnitude is discussed in the context of joint source separation and dereverberation. The high resolution and robustness properties [24] of group delay spectral magnitude result in smooth and robust subband envelopes. This reduces the error in an NMF decomposition of observed subband envelope of mixture into its convolutional components. This is primarily due to the accurate decomposition of the subband envelope computed from GDSM
Algorithm for joint source separation and dereverberation
The block diagram of the proposed joint source separation and dereverberation algorithm is illustrated in Fig. 6. The algorithmic steps involved in joint source separation and dereverberation are detailed in Algorithm 1. The spectrographic analysis of the proposed method is explained herein.
Performance evaluation
In this section, the experiments on source separation, dereverberation and distant speech recognition are evaluated. The performance of the source separation is evaluated in terms of subjective, objective and target-to-interference ratio (TIR) measures. The reconstructed target signal obtained from the proposed GDSM method is compared with other separation methods at various TIRs. Additionally, the experiments are conducted to evaluate the quality of speech dereverberation using objective
Conclusions
A method for performing joint source separation and dereverberation by minimizing the divergence between the observed and true subband envelopes obtained from the group delay spectral magnitude (GDSM) is proposed in this work. Advantages of the GDSM include robustness to noise and reverberation when compared to FFT spectral magnitude. Due to the high resolution property of group delay spectral magnitude, this method reduces the error in the decomposition of mixed signal into its convolutional
Acknowledgment
This paper is supported and funded by IIT Kanpur MIPS Lab.
References (61)
- et al.
A real-time blind source separation scheme and its application to reverberant and noisy acoustic environments
Signal Process.
(2006) - et al.
A tractable framework for estimating and combining spectral source models for audio source separation
Signal Process.
(2012) - et al.
Blind separation of non-negative source signals using multiplicative updates and subspace projection
Signal Process.
(2010) - et al.
Underdetermined blind source separation using sparse representations
Signal Process.
(2001) - et al.
Chirp group delay analysis of speech signals
Speech Commun.
(2007) - et al.
Monaural speech/music source separation using discrete energy separation algorithm
Signal Process.
(2010) - et al.
Blind source separation of underdetermined mixtures of event-related sources
Signal Process.
(2014) - et al.
Convolutive blind source separation based on joint block toeplitzation and block-inner diagonalization
Signal Process.
(2010) - et al.
Image method for efficiently simulating small-room acoustics
J. Acoust. Soc. Am.
(1979) - M. Anand Joseph, S. Guruprasad, B. Yegnanarayana, Extracting formants from short segments of speech using group delay...
Speech Enhancement
Speaker recognitiona tutorial
Proc. IEEE
An audio-visual corpus for speech perception and automatic speech recognition
J. Acoust. Soc. Am.
Subjective and objective quality assessment of audio source separation
IEEE Trans. Audio Speech Lang. Process.
Correlation-based and model-based blind single-channel late-reverberation suppression in noisy time-varying acoustical environments
IEEE Trans. Audio Speech Lang. Process.
TIMIT: Acoustic-phonetic Continuous Speech Corpus, Linguistic Data Consortium, LDC93S1
Significance of joint features derived from the modified group delay function in speech processing
EURASIP J. Audio Speech Music Process.
One-way anova, in: R Through Excel
A review of the mtf concept in room acoustics and its use for estimating speech intelligibility in auditoria
J. Acoust. Soc. Am.
Cited by (10)
Single channel speech dereverberation and separation using RPCA and SNMF
2020, Applied AcousticsCitation Excerpt :Every CDAE has been trained to separate one speech by treating the other speech as noise. Several algorithms [26–30] have been proposed for speech separation, but the performance of these algorithms degrade with the intrusion of reverberation effects into the speech mixtures [31]. Furthermore, many dereverberation algorithms need a-priori knowledge about the RIRs which is mostly unavailable.
A Bayesian approach to convolutive nonnegative matrix factorization for blind speech dereverberation
2018, Signal ProcessingCitation Excerpt :The first kind refers to those that begin with a training stage that serves to learn some characteristics of the reververation conditions, while the second kind alludes to those methods that can be implemented directly over the reverberant signal. Some supervised methods [13–15] appear to perform somewhat better than unsupervised ones, but they pose the disadvantage of needing learning data corresponding to the specific room conditions, microphone and source locations, and a previous process that might take a significant amount of time. In the context of unsupervised blind dereverberation, although some recently proposed methods [12,16] work reasonably well, there is still much room for improvement.
Speech dereverberation and source separation using DNN-WPE and LWPR-PCA
2023, Neural Computing and ApplicationsExploiting Sparse Recovery Algorithms for Semi-Supervised Training of Deep Neural Networks for Direction-of-Arrival Estimation
2023, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - ProceedingsGroup Delay based Methods for Detection and Recognition of Whispered Speech
2022, Proceedings - International Conference on Pattern Recognition