Skip to main content
Log in

Monaural voiced speech segregation based on elaborate harmonic grouping strategies

  • Research Papers
  • Special Focus
  • Published:
Science China Information Sciences Aims and scope Submit manuscript

Abstract

In this paper, an enhanced algorithm based on several elaborate harmonic grouping strategies for monaural voiced speech segregation is proposed. Main achievements of the proposed algorithm lie in three aspects. Firstly, the algorithm classifies the time-frequency (T-F) units into resolved and unresolved ones by carrier-to-envelope energy ratio, which leads to more accurate classification results than by cross-channel correlation. Secondly, resolved T-F units are grouped together according to minimum amplitude principle, which has been verified to exist in human perception, as well as the harmonic principle. Finally, “enhanced” envelope autocorrelation function is employed to detect amplitude modulation rates, which helps a lot in reducing half-frequency error in grouping of unresolved units. Systematic evaluation and comparison show that performance of separation is greatly improved by the proposed algorithm. Specifically, signal-to-noise ratio (SNR) is improved by 0.96 dB compared with that of previous method. Besides, our algorithm is also effective in improving the PESQ score and subjective perception score.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Boll S. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans Acoustics Speech Signal Process, 1979, 27: 113–120

    Article  Google Scholar 

  2. Paliwal K, Wojcicki K, Schwerin B. Single-channel speech enhancement using spectral subtraction in the short-time modulation domain. Speech Commun, 2010, 52: 450–475

    Article  Google Scholar 

  3. Benesty J, Makino S, Chen J. Speech Enhancement. New York: Springer, 2005

    Google Scholar 

  4. Asano F, Ikeda S, Ogawa M, et al. Combined approach of array processing and independent component analysis for blind separation of acoustic signals. IEEE Trans Speech Audio Process, 2003, 11: 204–215

    Article  Google Scholar 

  5. Koldovsky Z, Tichavsky P. Time-domain blind separation of audio sources based on a complete ICA decomposition of an observation space. IEEE Trans Audio Speech Lang Process, 2011, 19: 406–416

    Article  Google Scholar 

  6. Wang D L, Brown G J. Computational auditory scene analysis: principles, algorithms and applications. New Jersey: Wiley-IEEE Press, 2006

    Google Scholar 

  7. Bregman S. Auditory Scene Analysis. MA: MIT Press, 1990

    Google Scholar 

  8. Weintraub M. A theory and computational model of monaural auditory sound separation. Dissertation for Doctoral Degree. Palo Alto: Stanford University, 1985

    Google Scholar 

  9. Cooke M P. Modeling auditory processing and organization. Dissertation for Doctoral Degree. Sheffield: University of Sheffield, 1991

    Google Scholar 

  10. Hu G N, Wang D L. Monaural speech segregation based on pitch tracking and amplitude modulation. IEEE Trans Neural Netw, 2004, 15: 1135–1150

    Article  Google Scholar 

  11. Li P, Guan Y, Wang S, et al. Monaural speech separation based on MAXVQ and CASA for robust speech recognition. Comput Speech Lang, 2010, 24: 30–44

    Article  Google Scholar 

  12. Carlyon R P, Shackleton T M. Comparing the fundamental frequencies of resolved and unresolved harmonics: evidence for two pitch mechanisms? J Acoust Soc Am, 1994, 95: 3541–3554

    Article  Google Scholar 

  13. Klapuri A. Auditory-model based methods for multiple fundamental frequency estimation. In: Signal Processing Methods for Music Transcription. New York: Springer, 2006. 229–265

    Chapter  Google Scholar 

  14. de Boer E, de Jongh H R. On cochlear encoding: potentialities and limitations of the reverse-correlation techniques. J Acoust Soc Amer, 1978, 63: 115–135

    Article  Google Scholar 

  15. Kohlrausch A, Fassel R, Dau T. The influence of carrier level and frequency on modulation and beat-detection thresholds for sinusoidal carriers. J Acoust soc Am, 2000, 108: 723–734

    Article  Google Scholar 

  16. Tolonen T, Karjalainen M. A computationally efficient multipitch analysis model. IEEE Trans Speech Audio Process, 2000, 8: 708–716

    Article  Google Scholar 

  17. Hu G, Wang D L. A tandem algorithm for pitch estimation and voiced speech segregation. IEEE Trans Audio Speech Lang Process, 2010, 18: 2067–2079

    Article  Google Scholar 

  18. Wang D L. On ideal binary mask as the computational goal of auditory scene analysis. In: Divenyi P, ed. Speech Separation by Humans and Machines. Boston: Kluwer, 2005. 181–197

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to WenJu Liu.

Additional information

LIU WenJiu was born in 1960. He received the B.S., M.S. degrees in mathematics from Peking University and Beijing University of Posts and Telecommunications, and Ph.D. degree in computer applications from Tsinghua University, Beijing, China, in 1983, 1989 and 1993, respectively. Currently, he is a research professor at the National Key Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China. His research interests include speech recognition, speech synthesis, speaker recognition, key words spotting, computational auditory scene analysis, speech enhancement, noise reduction, etc. Dr. Liu Wenju is a member of Neural Network Committee of China and the Signal Processing Society of the IEEE. He is an editorial board member of journal of Computer Science Application as well as a reviewer of numerous academic journals such as IEEE Transaction on Audio, Speech, and Language Processing, Cognitive Computation, etc.

JIANG Wei was born in 1982. He reveived the B.S. degree from Yanshan University in Qinhuangdao, China in 2005 and the M.S. degree from Harbin Institute of Technology in Harbin, China in 2008. He is currently working toward the Ph.D. degree at the Institute of Automation, Chinese Academy of Sciences, Beijing, China. His research interests include speech segregation, computational auditory scene analysis and acoustic properties of speech.

ZHANG XueLiang was born in 1981. He received the B.S. degree from Inner Mongolia University in Hohhot, China in 2003 and the M.S. degree from Harbin Institute of Technology in Harbin, China in 2005 and the Ph.D. degree in Pattern Recognition and Intelligent System from Institute of Automation, Chinese Academy of Sciences, Beijing, China in 2010. Currently, he is a lecturer at the Computer Sciences Department, Inner Mongolia University. His research interests include speech separation, computational auditory scene analysis and speech signal processing. Dr. Zhang Xueliang is a member of International Speech Communication Association.

Electronic supplementary material

Supplementary material, approximately 2.75 MB.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, W., Zhang, X., Jiang, W. et al. Monaural voiced speech segregation based on elaborate harmonic grouping strategies. Sci. China Inf. Sci. 54, 2471–2480 (2011). https://doi.org/10.1007/s11432-011-4506-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11432-011-4506-2

Keywords

Navigation