Skip to main content
Log in

Co-channel Speech Separation Based on Amplitude Modulation Spectrum Analysis

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

A lot of effort has been made to achieve co-channel (two-talker) speech separation. However, the comprehensive analysis of the amplitude modulation spectrum (AMS) to address this problem has received little attention. In this paper, we propose an approach to exploit the AMS and to perform the separation based on the framework of computational auditory scene analysis (CASA). Specifically, this method utilizes the periodicity encoded in the AMS and then makes the channel selection. The main features of the approach are: (1) the reassignment method is used to improve the spectral resolution of the AMS in short duration; (2) a template-based pitch detector is used to determine the dominant fundamental frequency (F0) in an individual channel; (3) segmentation and grouping, the two stages in the CASA-based approaches, are employed to increase the robustness of channel selection. Systematic evaluation and comparison show that the proposed approach yields better performance than the previous system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. A T-F unit corresponds to a certain filter at a specific time frame.

  2. The ERB of a filter is defined as the bandwidth of an ideal rectangular filter which has a response in its passband equal to the maximum response of the specified filter and transmits the same total power of white noise as the specified filter.

  3. See the next footnote for the definition of the dominant F0 in a unit.

  4. The dominant F0 in T-F unit u(c,m) is defined as the fundamental frequency corresponding to the maximum of R c,m (f) within the plausible pitch range of human speech, i.e., [100, 400 Hz] in this paper.

  5. A correlogram is an autocorrelation of every filter response in an auditory filter bank [11].

References

  1. F. Auger, P. Flandrin, Improving the readability of time-frequency and time-scale representations by the reassignment method. IEEE Trans. Signal Process. 43(5), 1068–1089 (1995)

    Article  Google Scholar 

  2. A. Bregman, Auditory Scene Analysis (MIT Press, Cambridge, 1990)

    Google Scholar 

  3. G. Brown, M. Cooke, Computational auditory scene analysis. Comput. Speech Lang. 8, 297–336 (1994)

    Article  Google Scholar 

  4. P. Boersma, D. Weenink, Praat: doing phonetics by computer (Version 4.3.14) (2005). Accessed 21 August 2013. See http://www.fon.hum.uva.nl/praat

  5. T. Dau, D. Puschel, A. Kohlrausch, A quantitative model of the “effective” signal processing in the auditory system. I. Model structure. J. Acoust. Soc. Am. 99(6), 3615–3622 (1996)

    Article  Google Scholar 

  6. D. Ellis, Prediction-driven computational auditory scene analysis. Ph.D. Dissertation, Mass. Inst. of Technol, Cambridge, MA (1996)

  7. C. Fevotte, S. Godsill, A Bayesian approach for blind separation of sparse sources. IEEE Trans. Audio Speech Lang. Process. 14(6), 2174–2188 (2006)

    Article  Google Scholar 

  8. P. Flandrin, F. Auger, E. Chassande-Mottin, Time-frequency reassignment: from principles to algorithms, in Applications in Time-Frequency Signal Processing, ed. by P. Antonia (CRC Press, Boca Raton, 2003), pp. 179–203

    Google Scholar 

  9. J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett, N. Dahlgren, DARPA TIMIT acoustic-phonetic continuous speech corpus. Technical Report NISTIR 4930, National Inst. of Standards and Technol, Gaithersburg, MD (1993)

  10. D. Hermes, Measurement of pitch by subharmonic summation. J. Acoust. Soc. Am. 83(1), 257–264 (1988)

    Article  Google Scholar 

  11. G. Hu, D. Wang, Monaural speech segregation based on pitch tracking and amplitude modulation. IEEE Trans. Neural Netw. 15(5), 1135–1150 (2004)

    Article  Google Scholar 

  12. G. Hu, D. Wang, A tandem algorithm for pitch estimation and voiced speech segregation. IEEE Trans. Audio Speech Lang. Process. 18(8), 2067–2079 (2010)

    Article  Google Scholar 

  13. G. Hu, D. Wang, Auditory segmentation based on onset and offset analysis. IEEE Trans. Audio Speech Lang. Process. 15(2), 396–405 (2007)

    Article  Google Scholar 

  14. J. Hershey, S. Rennie, P. Olsen, T. Kristjansson, Super-human multi-talker speech recognition: a graphical modeling approach. Comput. Speech Lang. 24, 45–66 (2010)

    Article  Google Scholar 

  15. J. Holdsworth, I. Nimmo-Smith, R. Patterson, P. Rice, Implementing a gammatone filter bank. MRC Appl. Psychology Unit Rep. (1988)

  16. K. Han, D. Wang, A classification based approach to speech segregation. J. Acoust. Soc. Am. 132(5), 3475–3483 (2012)

    Article  Google Scholar 

  17. K. Hu, D. Wang, An unsupervised approach to cochannel speech separation. IEEE Trans. Audio Speech Lang. Process. 21(1), 122–131 (2013)

    Article  Google Scholar 

  18. G. Jang, T. Lee, A probabilistic approach to single channel blind source separation, in Advances in Neural Inf. Process, Syst., (2003), pp. 1173–1180

    Google Scholar 

  19. Z. Jin, D. Wang, HMM-based multipitch tracking for noisy and reverberant speech. IEEE Trans. Audio Speech Lang. Process. 19(5), 1091–1102 (2011)

    Article  Google Scholar 

  20. A. Klapuri, Signal processing methods for the automatic transcription of music. Ph.D. Dissertation, Tampere University of Technol, Finland (2004)

  21. A. Klapuri, Multipitch analysis of polyphonic music and speech signals using an auditory model. IEEE Trans. Audio Speech Lang. Process. 16(2), 255–266 (2008)

    Article  Google Scholar 

  22. B. Kollmeier, R. Koch, Speech enhancement based on physiological and psychoacoustical models of modulation perception and binaural interaction. J. Acoust. Soc. Am. 95(3), 1593–1602 (1994)

    Article  Google Scholar 

  23. G. Kim, Y. Lu, Y. Hu, P. Loizou, An algorithm that improves speech intelligibility in noise for normal-hearing listeners. J. Acoust. Soc. Am. 129(3), 1486–1494 (2009)

    Article  Google Scholar 

  24. M. Lewicki, T. Sejnowski, Learning nonlinear overcomplete representations for efficient coding, in Advances in Neural Inf. Process. Syst., ed. by M. Jordan, M. Kearns, S. Solla (MIT Press, Cambridge, 1998)

    Google Scholar 

  25. N. Li, P. Loizou, Factors infuencing intelligibility of ideal binary-masked speech: implications for noise reduction. J. Acoust. Soc. Am. 123(3), 1673–1682 (2008)

    Article  Google Scholar 

  26. P. Li, Y. Guan, B. Xu, W. Liu, Monaural speech separation based on computational auditory scene analysis and objective quality assessment of speech. IEEE Trans. Audio Speech Lang. Process. 14(6), 2014–2023 (2006)

    Article  Google Scholar 

  27. Y. Li, D. Wang, On the optimality of ideal binary time frequency masks. Speech Commun. 51, 230–239 (2009)

    Article  Google Scholar 

  28. A. Mahmoodzadeh, H. Abutalebi, H. Soltanian-Zadeh, H. Sheikhzadeh, Single channel speech separation in modulation frequency domain based on a novel pitch range estimation method. EURASIP J. Adv. Signal Process. (2012). doi:10.1186/1687-6180-2012-67

    Google Scholar 

  29. B. Moore, An Introduction to the Psychology of Hearing (Academic Press, San Diego, 2003)

    Google Scholar 

  30. P. Mowlaee, M. Christensen, S. Jensen, New results on single-channel speech separation using sinusoidal modeling. IEEE Trans. Audio Speech Lang. Process. 19(5), 1265–1277 (2011)

    Article  Google Scholar 

  31. A. Noll, Cepstrum pitch determination. J. Acoust. Soc. Am. 41(2), 293–309 (1967)

    Article  MathSciNet  Google Scholar 

  32. F. Plante, G. Meyer, W. Ainsworth, Improvement of speech spectrogram accuracy by the method of reassignment. IEEE Trans. Speech Audio Process. 6(3), 282–286 (1998)

    Article  Google Scholar 

  33. R. Patterson, I. Nimmo-Smith, J. Holdsworth, P. Rice, An efficient auditory filterbank based on the gammatone function. MRC Appl. Psychology Unit Rep. pp. 1–33 (1988)

  34. A. Reddy, B. Raj, Soft mask methods for single-channel speaker separation. IEEE Trans. Audio Speech Lang. Process. 15(6), 1766–1776 (2007)

    Article  Google Scholar 

  35. M. Radfar, R. Dansereau, Single-channel speech separation using soft mask filtering. IEEE Trans. Audio Speech Lang. Process. 15(8), 2299–2310 (2007)

    Article  Google Scholar 

  36. S. Roweis, Factorial models and refiltering for speech separation and denoising, in Proc. ISCA European Conference Speech Communication and Technology (EuroSeech) (2003), pp. 1009–1012

    Google Scholar 

  37. A. Shapiro, C. Wang, A versatile pitch tracking algorithm: from human speech to killer whale vocalizations. J. Acoust. Soc. Am. 126(1), 451–459 (2009)

    Article  Google Scholar 

  38. H. Strube, H. Wilmers, Noise reduction for speech signals by operation on the modulation frequency spectrum. J. Acoust. Soc. Am. 105(2), 1092 (1999)

    Article  Google Scholar 

  39. M. Stark, M. Wohlmayr, F. Pernkopf, Source-filter-based single-channel speech separation using pitch information. IEEE Trans. Audio Speech Lang. Process. 19(2), 242–255 (2011)

    Article  Google Scholar 

  40. M. Schmidt, R. Olsson, Single-channel speech separation using sparse non-negative matrix factorization, in Proc. of the International Conference on Spoken Lang. Process. (ICSLP) (2006), pp. 2614–2617

    Google Scholar 

  41. S. Schimmel, K. Fitz, L. Atlas, Frequency reassignment for coherent modulation filtering, in Proceedings of the International Conference on Acoust., Speech, and Signal Process (ICASSP) (2006), pp. 261–264

    Google Scholar 

  42. J. Tchorz, B. Kollmeier, SNR estimation based on amplitude modulation analysis with applications to noise suppression. IEEE Trans. Speech Audio Process. 11(3), 184–192 (2001)

    Article  Google Scholar 

  43. C. Wang, S. Seneff, Robust pitch tracking for prosodic modeling in telephone speech, in Proceedings of the International Conference on Acoust., Speech, and Signal Process (ICASSP), Istanbul, Turkey (2000), pp. 1143–1146

    Google Scholar 

  44. D. Wang, G. Brown (eds.), Computational Auditory Scene Analysis: Principles, Algorithms, and Applications (Wiley/IEEE Press, New York, 2006)

    Google Scholar 

  45. D. Wang, U. Kjems, M. Pedersen, J. Boldt, Speech perception of noise with binary gains. J. Acoust. Soc. Am. 125(4), 2303–2307 (2008)

    Article  Google Scholar 

  46. D. Wang, On ideal binary masks as the computational goal of auditory scene analysis, in Speech Separation by Humans and Machines, ed. by P. Divenyi (Kluwer Academic, Boston, 2005), pp. 181–197

    Chapter  Google Scholar 

  47. M. Weintraub, A theory and computational model of auditory monaural sound separation. Ph.D. Dissertation, Stanford University, CA, USA (1985)

  48. R. Weiss, D. Ellis, Speech separation using speaker-adapted eigenvoice speech models. Comput. Speech Lang. 24, 16–29 (2010)

    Article  Google Scholar 

  49. D. Yang, G. Meyer, W. Ainsworth, Vowel separation using the reassigned amplitude-modulation spectrum, in Proc. of the International Conference on Spoken Lang. Process. (ICSLP) (1998), p. 0511

    Google Scholar 

Download references

Acknowledgements

This work is supported in part by the National Natural Science Foundation of China (60772039 and 61202265). The authors would also like to thank the anonymous reviewers for their helpful suggestions/criticisms.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qi Hu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hu, Q., Liang, MG. Co-channel Speech Separation Based on Amplitude Modulation Spectrum Analysis. Circuits Syst Signal Process 33, 565–588 (2014). https://doi.org/10.1007/s00034-013-9656-6

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-013-9656-6

Keywords

Navigation