Skip to main content
Log in

Separation of Multiple Speech Sources in Reverberant Environments Based on Sparse Component Enhancement

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

Multiple speech source separation plays an important role in many applications such as automatic speech recognition, acoustical surveillance, and teleconferencing. In this study, we propose a method for the separation of multiple speech sources in a reverberant environment based on sparse component enhancement. In a recorded signal (i.e., a mixed signal of multiple speech sources), there are always time–frequency points where only one source is active or dominant. It is the sparsity of speech signals. Such time–frequency points are called sparse component points. However, in a reverberant environment, the sparsity of the speech signal is affected, resulting in a decrease in the number of sparse component points in the recorded signal, which affects the quality of the separated source signal. In this study, for mixture signals recorded by a soundfield microphone (a microphone array), we first experimentally analyze the negative impact of reverberation on sparse components and then develop a sparse component enhancement method to increase the number of these points. Then, the sparse components are identified and classified according to the directions of arrival estimate of the sources. Next, the sparse components are used to guide the recovery of the non-sparse components. Finally, multiple source separation is achieved by the joint restoration of the sparse and non-sparse components of each source. The proposed method has low computational complexity and applies to underdetermined scenarios. Through a series of subjective and objective evaluation experiments, the effectiveness of the method is verified.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Data Availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author upon reasonable request.

References

  1. Method for the subjective assessment of intermediate quality levels of coding systems, ITU-R Recommendation BS. 1534 (2001)

  2. D. Baby, H. Van Hamme, Supervised speech dereverberation in noisy environments using exemplar-based sparse representations, in Proceedings of 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 156–160 (2016). https://doi.org/10.1109/ICASSP. 2016.7471656

  3. J. Benesty, I. Cohen, J. Chen, Adaptive beamforming. Fundam. Signal Enhanc. Array Signal Process. IEEE (2018). https://doi.org/10.1002/9781119293132.ch8

    Article  Google Scholar 

  4. D. Blanco, B. Mulgrew, S. McLaughlin, ICA method for speckle signals [blind source separation application], in Proceedings of 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 821–824 (2004). 10.1109/ ICASSP.2004.1326384.

  5. C. Boeddeker, T. Nakatani, K. Kinoshita, R. Haeb-Umbach, Jointly optimal dereverberation and beamforming, in Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 216–220 (2020)

  6. C. Boeddeker, F. Rautenberg, R. Haeb-Umbach, A comparison and combination of unsupervised blind source separation techniques, in Speech Communication, 14th ITG Conference, online, pp. 1–5 (2021)

  7. D.R. Campbell, K.J. Palomaki, G.J. Brown, A MATLAB simulation of ‘shoebox’ room acoustics for use in teaching and research. Comput. Inf. Syst. 9(3), 48–51 (2005)

    Google Scholar 

  8. C.S.J. Doire et al., Single-channel online enhancement of speech corrupted by reverberation and noise. IEEE/ACM Trans. Audio Speech Lang. Process. 25(3), 572–587 (2017). https://doi.org/10.1109/TASLP.2016.2641904

    Article  Google Scholar 

  9. G.B. Giannakis, Linear predictive algorithms for blind multichannel identification, in Signal processing advances in wireless and mobile communications. pp. 179–182 (2002)

  10. B.W. Gillespie, H.S. Malvar, D.A.F. Florencio, Speech dereverberation via maximum-kurtosis subband adaptive filtering, in Proceedings of 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. vol. 6. pp. 3701–3704 (2001). 10.1109/ ICASSP.2001.940646.

  11. R. Gribonval, S. Lesage, A survey of Sparse Component Analysis for blind source separation: principles, perspectives, and new challenges, in Proceedings of 14th European Symposium on Artificial Neural Networks. pp. 323–330 (2006)

  12. B. Gunel, H. Hacihabiboglu, A.M. Kondoz, Acoustic source separation of convolutive mixtures based on intensity vector statistics. IEEE Trans. Audio Speech Lang. Process. 16(4), 748–756 (2008). https://doi.org/10.1109/TASL.2008.918967

    Article  Google Scholar 

  13. G. Huang, J. Benesty, I. Cohen, J. Chen, A simple theory and new method of differential beamforming with uniform linear microphone arrays. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1079–1093 (2020). https://doi.org/10.1109/TASLP.2020.2980989

    Article  Google Scholar 

  14. A. Hyvärinen, E. Oja, Independent component analysis: algorithms and applications. Neural Netw. 13, 411–430 (2000). https://doi.org/10.1016/s0893-6080(00)00026-5

    Article  Google Scholar 

  15. M. Jia, Y. Jia, S. Gao, Multi-source DOA estimation in reverberant environments using potential single-source points enhancement. Appl. Acoust. 174, 107782 (2021)

    Article  Google Scholar 

  16. M. Jia, J. Sun, C. Bao, Real-time multiple sound source localization and counting using a soundfield microphone. J. Ambient. Intell. Humaniz. Comput. 8(6), 829–844 (2017). https://doi.org/10.1007/s12652-016-0388-x

    Article  Google Scholar 

  17. M. Jia, J. Sun, C. Bao, C. Ritz, Separation of multiple speech sources by recovering sparse and non-sparse components from B-format microphone recordings. Speech Commun. 96, 184–196 (2018). https://doi.org/10.1016/j.specom.2017.12.010

    Article  Google Scholar 

  18. A. Jourjine, S. Rickard, O. Yilmaz O, Blind separation of disjoint orthogonal signals: demixing N sources from 2 mixtures, in Proceedings of 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings, vol.5. pp. 2985–2988 (2000). 10.1109/ ICASSP.2000.861162.

  19. J. Liu, Q. Yang, M. Jia, X. Zhang, Multiple sound source separation by jointing single source zone detection and linearly constrained minimum variance, in 2020 9th International Conference on Computing and Pattern Recognition (Xiamen, China, 2020), pp. 141–145. doi: https://doi.org/10.1145/3436369.3437435

  20. M. Miyoshi, Y. Kaneda, Inverse filtering of room acoustics. IEEE Trans. Acoust. Speech Signal Process. 36(2), 145–152 (1988). https://doi.org/10.1109/29.1509

    Article  Google Scholar 

  21. S. Mosayyebpour, H. Sheikhzadeh, T.A. Gulliver, M. Esmaeili, Single-microphone LP residual skewness-based inverse filtering of the room impulse response. IEEE Trans. Audio Speech Lang. Process. 20(5), 1617–1632 (2012). https://doi.org/10.1109/TASL.2012.2186804

    Article  Google Scholar 

  22. T. Nakashima, R. Scheibler, M. Togami, N. Ono, Joint dereverberation and separation with iterative source steering, in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Toronto, ON, Canada, 2021), pp. 216–220. doi: https://doi.org/10.1109/ICASSP39728.2021.9413478.

  23. T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, B. Juang, Speech dereverberation based on variance-normalized delayed linear prediction. IEEE Trans. Audio Speech Lang. Process. 18(7), 1717–1731 (2010). https://doi.org/10.1109/TASL.2010.2052251

    Article  Google Scholar 

  24. S.T. Neely, J.B. Allen, Invertibility of a room impulse response. J. Acoust. Soc. Am. 66(1), 165–169 (1979)

    Article  Google Scholar 

  25. NTT database, 2008. More information see as https://www.ntt-at.com/cn/product/list_encoder_decoder_voice_database.html.

  26. C. Osterwise, S.L. Grant, On over-determined frequency domain BSS. IEEE/ACM Trans. Audio Speech Lang. Process. 22(5), 956–966 (2014). https://doi.org/10.1109/TASLP.2014.2307166

    Article  Google Scholar 

  27. D. Pavlidi, A. Griffin, M. Puigt, A. Mouchtaris, Real-time multiple sound source localization and counting using a circular microphone array. IEEE Trans. Audio Speech Lang. Process. 21(10), 2193–2206 (2013). https://doi.org/10.1109/TASL.2013.2272524

    Article  Google Scholar 

  28. S. Qin, J. Guo, C. Zhu, Sparse component analysis using time-frequency representations for operational modal analysis. Sensors (Switzerland) 15(3), 6497–6519 (2015). https://doi.org/10.3390/s150306497

    Article  Google Scholar 

  29. A.W. Rix, J.G. Beerends, M.P. Hollier, A.P. Hekstra, Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs, in Proceedings of 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. vol.2, pp. 749–752 (2001). https://doi.org/10.1109/ICASSP.2001.941023

  30. G.A. Soulodre, About this dereverberation business: a method for extracting reverberation from audio signals, in Proceedings of the 129th Convention of the Audio Engineering Society, vol. 2, pp.821–835, (2010)

  31. C.H. Taal, R.C. Hendriks, R. Heusdens, J. Jensen, An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 19(7), 2125–2136 (2011). https://doi.org/10.1109/TASL.2011.211488

    Article  Google Scholar 

  32. R. Talmon, I. Cohen, S. Gannot, Relative transfer function identification using convolutive transfer function approximation. IEEE Trans. Audio Speech Lang. Process. 17(4), 546–555 (2009). https://doi.org/10.1109/TASL.2008.2009576

    Article  Google Scholar 

  33. M. Togami, R. Scheibler, Over-determined speech source separation and dereverberation, in 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (Auckland, New Zealand, 2020), pp. 705–710

  34. E. Vincent, R. Gribonval, C. Fevotte, Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006). https://doi.org/10.1109/TSA.2005.858005

    Article  Google Scholar 

  35. L. Wang, H. Ding, F. Yin, A region-growing permutation alignment approach in frequency-domain blind source separation of speech mixtures. IEEE Trans. Audio Speech Lang. Process. 19(3), 549–557 (2011). https://doi.org/10.1109/TASL.2010.2052244

    Article  Google Scholar 

  36. Z. -Q. Wang, J.L. Roux, J.R. Hershey, Multi-channel deep clustering: discriminative spectral and spatial embeddings for speaker-independent speech separation, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5 (2018). https://doi.org/10.1109/ICASSP.2018.8461639.

  37. M. Wu, D. Wang, A two-stage algorithm for one-microphone reverberant speech enhancement. IEEE Trans. Audio Speech Lang. Process. 14(3), 774–784 (2006). https://doi.org/10.1109/TSA.2005.858066

    Article  Google Scholar 

  38. W. Yang, G. Huang, J. Chen, J. Benesty, I. Cohen, W. Kellermann, Robust dereverberation with Kronecker product based multichannel linear prediction. IEEE Signal Process. Lett. 28, 101–105 (2021). https://doi.org/10.1109/LSP.2020.3044796

    Article  Google Scholar 

  39. Y. Yang, S. Nagarajaiah, Output-only modal identification with limited sensors using sparse component analysis. J. Sound Vib. 332(19), 4741–4765 (2013). https://doi.org/10.1016/j.jsv.2013.04.004

    Article  Google Scholar 

  40. K. Yatabe, D. Kitamura, Determined BSS based on time-frequency masking and its application to harmonic vector analysis. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 1609–1625 (2021). https://doi.org/10.1109/TASLP.2021.3073863

    Article  Google Scholar 

  41. B. Yegnanarayana, P.S. Murthy, Enhancement of reverberant speech using LP residual signal. IEEE Trans. Speech Audio Process. 8(3), 267–281 (2000). https://doi.org/10.1109/89.841209

    Article  Google Scholar 

  42. O. Yilmaz, S. Rickard, Blind separation of speech mixtures via time-frequency masking. IEEE Trans. Signal Process. 52(7), 1830–1847 (2004). https://doi.org/10.1109/TSP.2004.828896

    Article  MathSciNet  MATH  Google Scholar 

  43. T. Yoshioka, T. Nakatani, Generalization of multi-channel linear prediction methods for blind MIMO impulse response shortening. IEEE Trans. Audio Speech Lang. Process. 20(10), 2707–2720 (2012). https://doi.org/10.1109/TASL.2012.2210879

    Article  Google Scholar 

  44. M. Zibulevsky, B.A. Pearlmutter, Blind source separation by sparse decomposition in a signal dictionary. Neural Comput. 13(4), 863–882 (2001). https://doi.org/10.1162/089976601300014385

    Article  MATH  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grants (No. 61971015), Beijing Natural Science Foundation (No. L223033), and the Cooperative Research Project of BJUT-NTUT (No. NTUT-BJUT-110-05).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Maoshen Jia.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, L., Jia, M., Liu, J. et al. Separation of Multiple Speech Sources in Reverberant Environments Based on Sparse Component Enhancement. Circuits Syst Signal Process 42, 6001–6028 (2023). https://doi.org/10.1007/s00034-023-02383-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-023-02383-6

Keywords

Navigation