Co-channel Speech Separation Based on Amplitude Modulation Spectrum Analysis

Hu, Qi; Liang, Man-Gui

doi:10.1007/s00034-013-9656-6

Co-channel Speech Separation Based on Amplitude Modulation Spectrum Analysis

Published: 25 September 2013

Volume 33, pages 565–588, (2014)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Qi Hu¹ &
Man-Gui Liang¹

260 Accesses
Explore all metrics

Abstract

A lot of effort has been made to achieve co-channel (two-talker) speech separation. However, the comprehensive analysis of the amplitude modulation spectrum (AMS) to address this problem has received little attention. In this paper, we propose an approach to exploit the AMS and to perform the separation based on the framework of computational auditory scene analysis (CASA). Specifically, this method utilizes the periodicity encoded in the AMS and then makes the channel selection. The main features of the approach are: (1) the reassignment method is used to improve the spectral resolution of the AMS in short duration; (2) a template-based pitch detector is used to determine the dominant fundamental frequency (F0) in an individual channel; (3) segmentation and grouping, the two stages in the CASA-based approaches, are employed to increase the robustness of channel selection. Systematic evaluation and comparison show that the proposed approach yields better performance than the previous system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

Article Open access 03 January 2024

Mahendra Kumar Gourisaria, Rakshit Agrawal, … Pradeep Kumar Singh

Adaptive attention mechanism for single channel speech enhancement

Article 04 April 2024

Veeraswamy Parisae & S Nagakishore Bhavanam

A Review on Sound Source Localization Systems

Article 05 May 2022

Dhwani Desai & Ninad Mehendale

Notes

A T-F unit corresponds to a certain filter at a specific time frame.
The ERB of a filter is defined as the bandwidth of an ideal rectangular filter which has a response in its passband equal to the maximum response of the specified filter and transmits the same total power of white noise as the specified filter.
See the next footnote for the definition of the dominant F0 in a unit.
The dominant F0 in T-F unit u(c,m) is defined as the fundamental frequency corresponding to the maximum of R _c,m(f) within the plausible pitch range of human speech, i.e., [100, 400 Hz] in this paper.
A correlogram is an autocorrelation of every filter response in an auditory filter bank [11].

References

F. Auger, P. Flandrin, Improving the readability of time-frequency and time-scale representations by the reassignment method. IEEE Trans. Signal Process. 43(5), 1068–1089 (1995)
Article Google Scholar
A. Bregman, Auditory Scene Analysis (MIT Press, Cambridge, 1990)
Google Scholar
G. Brown, M. Cooke, Computational auditory scene analysis. Comput. Speech Lang. 8, 297–336 (1994)
Article Google Scholar
P. Boersma, D. Weenink, Praat: doing phonetics by computer (Version 4.3.14) (2005). Accessed 21 August 2013. See http://www.fon.hum.uva.nl/praat
T. Dau, D. Puschel, A. Kohlrausch, A quantitative model of the “effective” signal processing in the auditory system. I. Model structure. J. Acoust. Soc. Am. 99(6), 3615–3622 (1996)
Article Google Scholar
D. Ellis, Prediction-driven computational auditory scene analysis. Ph.D. Dissertation, Mass. Inst. of Technol, Cambridge, MA (1996)
C. Fevotte, S. Godsill, A Bayesian approach for blind separation of sparse sources. IEEE Trans. Audio Speech Lang. Process. 14(6), 2174–2188 (2006)
Article Google Scholar
P. Flandrin, F. Auger, E. Chassande-Mottin, Time-frequency reassignment: from principles to algorithms, in Applications in Time-Frequency Signal Processing, ed. by P. Antonia (CRC Press, Boca Raton, 2003), pp. 179–203
Google Scholar
J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett, N. Dahlgren, DARPA TIMIT acoustic-phonetic continuous speech corpus. Technical Report NISTIR 4930, National Inst. of Standards and Technol, Gaithersburg, MD (1993)
D. Hermes, Measurement of pitch by subharmonic summation. J. Acoust. Soc. Am. 83(1), 257–264 (1988)
Article Google Scholar
G. Hu, D. Wang, Monaural speech segregation based on pitch tracking and amplitude modulation. IEEE Trans. Neural Netw. 15(5), 1135–1150 (2004)
Article Google Scholar
G. Hu, D. Wang, A tandem algorithm for pitch estimation and voiced speech segregation. IEEE Trans. Audio Speech Lang. Process. 18(8), 2067–2079 (2010)
Article Google Scholar
G. Hu, D. Wang, Auditory segmentation based on onset and offset analysis. IEEE Trans. Audio Speech Lang. Process. 15(2), 396–405 (2007)
Article Google Scholar
J. Hershey, S. Rennie, P. Olsen, T. Kristjansson, Super-human multi-talker speech recognition: a graphical modeling approach. Comput. Speech Lang. 24, 45–66 (2010)
Article Google Scholar
J. Holdsworth, I. Nimmo-Smith, R. Patterson, P. Rice, Implementing a gammatone filter bank. MRC Appl. Psychology Unit Rep. (1988)
K. Han, D. Wang, A classification based approach to speech segregation. J. Acoust. Soc. Am. 132(5), 3475–3483 (2012)
Article Google Scholar
K. Hu, D. Wang, An unsupervised approach to cochannel speech separation. IEEE Trans. Audio Speech Lang. Process. 21(1), 122–131 (2013)
Article Google Scholar
G. Jang, T. Lee, A probabilistic approach to single channel blind source separation, in Advances in Neural Inf. Process, Syst., (2003), pp. 1173–1180
Google Scholar
Z. Jin, D. Wang, HMM-based multipitch tracking for noisy and reverberant speech. IEEE Trans. Audio Speech Lang. Process. 19(5), 1091–1102 (2011)
Article Google Scholar
A. Klapuri, Signal processing methods for the automatic transcription of music. Ph.D. Dissertation, Tampere University of Technol, Finland (2004)
A. Klapuri, Multipitch analysis of polyphonic music and speech signals using an auditory model. IEEE Trans. Audio Speech Lang. Process. 16(2), 255–266 (2008)
Article Google Scholar
B. Kollmeier, R. Koch, Speech enhancement based on physiological and psychoacoustical models of modulation perception and binaural interaction. J. Acoust. Soc. Am. 95(3), 1593–1602 (1994)
Article Google Scholar
G. Kim, Y. Lu, Y. Hu, P. Loizou, An algorithm that improves speech intelligibility in noise for normal-hearing listeners. J. Acoust. Soc. Am. 129(3), 1486–1494 (2009)
Article Google Scholar
M. Lewicki, T. Sejnowski, Learning nonlinear overcomplete representations for efficient coding, in Advances in Neural Inf. Process. Syst., ed. by M. Jordan, M. Kearns, S. Solla (MIT Press, Cambridge, 1998)
Google Scholar
N. Li, P. Loizou, Factors infuencing intelligibility of ideal binary-masked speech: implications for noise reduction. J. Acoust. Soc. Am. 123(3), 1673–1682 (2008)
Article Google Scholar
P. Li, Y. Guan, B. Xu, W. Liu, Monaural speech separation based on computational auditory scene analysis and objective quality assessment of speech. IEEE Trans. Audio Speech Lang. Process. 14(6), 2014–2023 (2006)
Article Google Scholar
Y. Li, D. Wang, On the optimality of ideal binary time frequency masks. Speech Commun. 51, 230–239 (2009)
Article Google Scholar
A. Mahmoodzadeh, H. Abutalebi, H. Soltanian-Zadeh, H. Sheikhzadeh, Single channel speech separation in modulation frequency domain based on a novel pitch range estimation method. EURASIP J. Adv. Signal Process. (2012). doi:10.1186/1687-6180-2012-67
Google Scholar
B. Moore, An Introduction to the Psychology of Hearing (Academic Press, San Diego, 2003)
Google Scholar
P. Mowlaee, M. Christensen, S. Jensen, New results on single-channel speech separation using sinusoidal modeling. IEEE Trans. Audio Speech Lang. Process. 19(5), 1265–1277 (2011)
Article Google Scholar
A. Noll, Cepstrum pitch determination. J. Acoust. Soc. Am. 41(2), 293–309 (1967)
Article MathSciNet Google Scholar
F. Plante, G. Meyer, W. Ainsworth, Improvement of speech spectrogram accuracy by the method of reassignment. IEEE Trans. Speech Audio Process. 6(3), 282–286 (1998)
Article Google Scholar
R. Patterson, I. Nimmo-Smith, J. Holdsworth, P. Rice, An efficient auditory filterbank based on the gammatone function. MRC Appl. Psychology Unit Rep. pp. 1–33 (1988)
A. Reddy, B. Raj, Soft mask methods for single-channel speaker separation. IEEE Trans. Audio Speech Lang. Process. 15(6), 1766–1776 (2007)
Article Google Scholar
M. Radfar, R. Dansereau, Single-channel speech separation using soft mask filtering. IEEE Trans. Audio Speech Lang. Process. 15(8), 2299–2310 (2007)
Article Google Scholar
S. Roweis, Factorial models and refiltering for speech separation and denoising, in Proc. ISCA European Conference Speech Communication and Technology (EuroSeech) (2003), pp. 1009–1012
Google Scholar
A. Shapiro, C. Wang, A versatile pitch tracking algorithm: from human speech to killer whale vocalizations. J. Acoust. Soc. Am. 126(1), 451–459 (2009)
Article Google Scholar
H. Strube, H. Wilmers, Noise reduction for speech signals by operation on the modulation frequency spectrum. J. Acoust. Soc. Am. 105(2), 1092 (1999)
Article Google Scholar
M. Stark, M. Wohlmayr, F. Pernkopf, Source-filter-based single-channel speech separation using pitch information. IEEE Trans. Audio Speech Lang. Process. 19(2), 242–255 (2011)
Article Google Scholar
M. Schmidt, R. Olsson, Single-channel speech separation using sparse non-negative matrix factorization, in Proc. of the International Conference on Spoken Lang. Process. (ICSLP) (2006), pp. 2614–2617
Google Scholar
S. Schimmel, K. Fitz, L. Atlas, Frequency reassignment for coherent modulation filtering, in Proceedings of the International Conference on Acoust., Speech, and Signal Process (ICASSP) (2006), pp. 261–264
Google Scholar
J. Tchorz, B. Kollmeier, SNR estimation based on amplitude modulation analysis with applications to noise suppression. IEEE Trans. Speech Audio Process. 11(3), 184–192 (2001)
Article Google Scholar
C. Wang, S. Seneff, Robust pitch tracking for prosodic modeling in telephone speech, in Proceedings of the International Conference on Acoust., Speech, and Signal Process (ICASSP), Istanbul, Turkey (2000), pp. 1143–1146
Google Scholar
D. Wang, G. Brown (eds.), Computational Auditory Scene Analysis: Principles, Algorithms, and Applications (Wiley/IEEE Press, New York, 2006)
Google Scholar
D. Wang, U. Kjems, M. Pedersen, J. Boldt, Speech perception of noise with binary gains. J. Acoust. Soc. Am. 125(4), 2303–2307 (2008)
Article Google Scholar
D. Wang, On ideal binary masks as the computational goal of auditory scene analysis, in Speech Separation by Humans and Machines, ed. by P. Divenyi (Kluwer Academic, Boston, 2005), pp. 181–197
Chapter Google Scholar
M. Weintraub, A theory and computational model of auditory monaural sound separation. Ph.D. Dissertation, Stanford University, CA, USA (1985)
R. Weiss, D. Ellis, Speech separation using speaker-adapted eigenvoice speech models. Comput. Speech Lang. 24, 16–29 (2010)
Article Google Scholar
D. Yang, G. Meyer, W. Ainsworth, Vowel separation using the reassigned amplitude-modulation spectrum, in Proc. of the International Conference on Spoken Lang. Process. (ICSLP) (1998), p. 0511
Google Scholar

Download references

Acknowledgements

This work is supported in part by the National Natural Science Foundation of China (60772039 and 61202265). The authors would also like to thank the anonymous reviewers for their helpful suggestions/criticisms.

Author information

Authors and Affiliations

Institute of Information Science, Beijing Jiaotong University, Beijing, 100044, P.R. China
Qi Hu & Man-Gui Liang

Authors

Qi Hu
View author publications
You can also search for this author in PubMed Google Scholar
Man-Gui Liang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qi Hu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hu, Q., Liang, MG. Co-channel Speech Separation Based on Amplitude Modulation Spectrum Analysis. Circuits Syst Signal Process 33, 565–588 (2014). https://doi.org/10.1007/s00034-013-9656-6

Download citation

Received: 18 January 2013
Revised: 22 August 2013
Published: 25 September 2013
Issue Date: February 2014
DOI: https://doi.org/10.1007/s00034-013-9656-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Co-channel Speech Separation Based on Amplitude Modulation Spectrum Analysis

Abstract

Access this article

Similar content being viewed by others

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

Adaptive attention mechanism for single channel speech enhancement

A Review on Sound Source Localization Systems

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Co-channel Speech Separation Based on Amplitude Modulation Spectrum Analysis

Abstract

Access this article

Similar content being viewed by others

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

Adaptive attention mechanism for single channel speech enhancement

A Review on Sound Source Localization Systems

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation