Skip to main content
Log in

Evaluating single-channel speech separation performance in transform-domain

  • Published:
Journal of Zhejiang University SCIENCE C Aims and scope Submit manuscript

Abstract

Single-channel separation (SCS) is a challenging scenario where the objective is to segregate speaker signals from their mixture with high accuracy. In this research a novel framework called subband perceptually weighted transformation (SPWT) is developed to offer a perceptually relevant feature to replace the commonly used magnitude of the short-time Fourier transform (STFT). The main objectives of the proposed SPWT are to lower the spectral distortion (SD) and to improve the ideal separation quality. The performance of the SPWT is compared to those obtained using mixmax and Wiener filter methods. A comprehensive statistical analysis is conducted to compare the SPWT quantization performance as well as the ideal separation quality with other features of log-spectrum and magnitude spectrum. Our evaluations show that the SPWT provides lower SD values and a more compact distribution of SD, leading to more acceptable subjective separation quality as evaluated using the mean opinion score.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Bach, F.R., Jordan, M.I., 2006. Learning spectral clustering, with application to speech separation. J. Mach. Learn. Res., 7(1):1963–2001.

    MathSciNet  Google Scholar 

  • Barker, J., Shao, X., 2007. Audio-Visual Speech Fragment Decoding. Proc. Int. Conf. on Auditory-Visual Speech Processing, p.37–42.

  • Barker, J., Cooke, M., Ellis, D., 2005. Decoding speech in the presence of other sources. Speech Commun., 45(1):5–25. [doi:10.1016/j.specom.2004.05.002]

    Article  Google Scholar 

  • Barker, J., Coy, A., Ma, N., Cooke, M., 2006. Recent Advances in Speech Fragment Decoding Techniques. 9th Int. Conf. on Spoken Language Processing, p.85–88.

  • Benaroya, L., Bimbot, F., Gribonval, R., 2006. Audio source separation with a single sensor. IEEE Trans. Audio Speech Lang. Process., 14(1):191–199. [doi:10.1109/TSA.2005.854110]

    Article  Google Scholar 

  • Bishop, C.M., 2006. Pattern Recognition and Machine Learning. Information Science and Statistics Series. Springer, New York, USA, p.2–3. [doi:10.1007/978-0-387-45528-0]

    Google Scholar 

  • Chatterjee, S., Sreenivas, T.V., 2008. Predicting VQ performance bound for LSF coding. IEEE Signal Process. Lett., 15(1):166–169. [doi:10.1109/LSP.2007.914786]

    Article  Google Scholar 

  • Chhikara, R., Folks, L., 1989. The Inverse Gaussian Distribution: Theory, Methodology and Applications. CRC Press, Marcel Dekker Inc., New York, USA, p.39–52.

    MATH  Google Scholar 

  • Christensen, M.G., Jakobsson, A., 2009. Multi-Pitch Estimation. Synthesis Lectures on Speech and Audio Processing. Morgan and Claypool Publishers, San Rafael, CA, USA, p.1–24. [doi:10.2200/S00178ED1V01Y200903SAP005]

    Google Scholar 

  • Cooke, M.P., Barker, J., Cunningham, S.P., Shao, X., 2006. An audiovisual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am., 120(5):2421–2424. [doi:10.1121/1.2229005]

    Article  Google Scholar 

  • Ellis, D.P.W., Weiss, R.J., 2006. Model-Based Monaural Source Separation Using a Vector-Quantized Phase-Vocoder Representation. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, p.957–960. [doi:10.1109/ICASSP.2006.1661436]

  • Gardner, W., Rao, B., 1995. Theoretical analysis of the high rate vector quantization of LPC parameters. IEEE Trans. Speech Audio Process., 3(5):367–381. [doi:10.1109/89.466658]

    Article  Google Scholar 

  • Gersho, A., Gray, R.M., 1992. Vector Quantization and Signal Compression. Kluwer Academic Publishers, Boston, USA, p.345–372.

    MATH  Google Scholar 

  • Gray, R.M., 1990. Source Coding Theory. Kluwer International Series in Engineering and Computer Science. Kluwer Academic Publishers, Boston, USA, p.43.

    Google Scholar 

  • Gu, L.Y., Stern, R.M., 2008. Single-Channel Speech Separation Based on Modulation Frequency. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, p.25–28.

  • Hai, L.V., Lois, L., 1998. A New General Distance Measure for Quantization of LSF and Their Transformed Coefficients. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, p.45–48.

  • Hendriks, R.C., Rainer, M., 2007. MAP estimators for speech enhancement under normal and Rayleigh inverse Gaussian distributions. IEEE Trans. Audio Speech Lang. Process., 15(3):918–927. [doi:10.1109/TASL.2006.889753]

    Article  Google Scholar 

  • Hermansky, H., 1990. Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am., 87(4):1738–1752. [doi:10.1121/1.399423]

    Article  Google Scholar 

  • Hermansky, H., Morgan, N., 1994. RASTA processing of speech. IEEE Trans. Speech Audio Process., 2(4):578–589. [doi:10.1109/89.326616]

    Article  Google Scholar 

  • Hu, G., Wang, D., 2004. Monaural speech segregation based on pitch tracking and amplitude modulation. IEEE Trans. Neur. Networks, 15(5):1135–1150. [doi:10.1109/TNN.2004.832812]

    Article  Google Scholar 

  • ITU-T P.862, 2001. Perceptual Evaluation of Speech Quality (PESQ): An Objective Method for End-to-End Speech Quality Assessment of Narrow-Band Telephone Networks and Speech Codecs. International Telecommunication Union, Geneva.

    Google Scholar 

  • Jensen, J., Heusdens, R., Jensen, S.H., 2003. A Perceptual Subspace Method for Sinusoidal Speech and Audio Modeling. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, p.401–404.

  • Kondoz, A.M., Evans, B.G., 1987. Hybrid Transform Coder for Low Bit Rate Speech Coding. Proc. European Conf. on Speech Technology, p.105–108.

  • Kondoz, A.M., Evans, B.G., 1988. A Robust Vector Quantized Sub-Band Coder for Good Quality Speech Coding at 9.6 Kb/s. IEEE 8th European Conf. on Area Communication, p.44–47.

  • Kristijansson, T., Hershey, J., Olsen, P., Rennie, S., Gopinath, R., 2006. Super-Human Multi-Talker Speech Recognition: The IBM Speech Separation Challenge System. 9th Int. Conf. on Spoken Language Processing, p.97–100.

  • Li, P., Guan, Y., Wang, S., Xu, B., Liu, W., 2010. Monaural speech separation based on MAXVQ and CASA for robust speech recognition. Comput. Speech & Lang., 24(1):30–44. [doi:10.1016/j.csl.2008.05.005]

    Article  Google Scholar 

  • Loizou, P., 2007. Speech Enhancement Theory and Practice. CRC Press, Boca Raton, FL, USA, p.143.

    Google Scholar 

  • Ma, J., Hu, Y., Loizou, P., 2009. Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions. J. Acoust. Soc. Am., 125(5):3387–3405. [doi:10.1121/1.3097493]

    Article  Google Scholar 

  • Martin, R., 2005. Speech enhancement based on minimum square error estimation and super-Gaussian priors. IEEE Trans. Speech Audio Process., 13(5):845–856. [doi:10.1109/TSA.2005.851927]

    Article  Google Scholar 

  • Moore, B.C.J., 1997. An Introduction to the Psychology of Hearing (4th Ed.). Academic Press, New York, San Diego, USA, p.89–103.

    Google Scholar 

  • Mowlaee, P., Sayadiyan, A., 2008. Model-Based Monaural Sound Separation by Split-VQ of Sinusoidal Parameters. 16th European Signal Processing Conf., p.1–5.

  • Mowlaee, P., Sayadiyan, A., 2009. Performance Evaluation for Transform Domain Model-Based Single-Channel Speech Separation. 7th ACS/IEEE Int. Conf. on Computer Systems and Applications, p.935–942. [doi:10.1109/AICCSA.2009.5069444]

  • Paliwal, K.K., Kleijn, W.B., 1995. Quantization of LPC Parameters. In: Kleijn, W.B., Paliwal, K.K. (Eds.), Speech Coding and Synthesis. Elsevier, Amsterdam, the Netherlands, p.443–466.

    Google Scholar 

  • Radfar, M.H., Sayadiyan, A., Dansereau, R.M., 2006a. A New Algorithm for Two-Talker Pitch Tracking in Single Channel Paradigm. Int. Conf. on Signal Processing.

  • Radfar, M.H., Dansereau, R.M., Sayadiyan, A., 2006b. Performance Evaluation of Three Features for Model-Based Single Channel Speech Separation Problem. 8th Int. Conf. on Spoken Language Processing, p.2610–2613.

  • Radfar, M.H., Dansereau, R.M., Sayadiyan, A., 2007. A maximum likelihood estimation of vocal-tract-related filter characteristics for single channel speech separation. EURASIP J. Audio Speech Music Process., 2007:Article ID 84186, p.1–15. [doi:10.1155/2007/84186]

    Article  Google Scholar 

  • Reddy, A.M., Raj, B., 2007. Soft mask methods for single-channel speaker separation. IEEE Trans. Audio Speech Lang. Process., 15(6):1766–1776. [doi:10.1109/TASL.2007.901310]

    Article  Google Scholar 

  • Roweis, S., 2003. Factorial Models and Refiltering for Speech Separation and Denoising. 8th European Conf. on Speech Communication and Technology, p.1009–1012.

  • So, S., Paliwal, K., 2007. A comparative study of LPC parameter representations and quantisation schemes for wideband speech coding. Dig. Signal Process., 17(1):114–137. [doi:10.1016/j.dsp.2005.10.002]

    Article  Google Scholar 

  • Spiegel, M.R., Lipschutz, S., Liu, J., 1998. Schaum’s Mathematical Handbook of Formulas and Tables. McGraw-Hill, New York, USA, p.111.

    Google Scholar 

  • Srinivasan, S., Wang, D., 2008. A model for multitalker speech perception. J. Acoust. Soc. Am., 124(5):3213–3224. [doi:10.1121/1.2982413]

    Article  Google Scholar 

  • Srinivasan, S., Shao, Y., Jin, Z., Wang, D.L., 2006. A Computational Auditory Scene Analysis System for Robust Speech Recognition. 9th Int. Conf. on Spoken Language Processing, p.73–76.

  • Stevens, J.C., Marks, L.E., 1965. Cross-modality matching of brightness and loudness. PNAS, 54(2):407–411. [doi:10.1073/pnas.54.2.407]

    Article  Google Scholar 

  • Tolonen, T., Karjalainen, M., 2000. A computationally efficient multipitch analysis model. IEEE Trans. Speech Audio Process., 8(6):708–716. [doi:10.1109/89.876309]

    Article  Google Scholar 

  • Wang, D.L., Brown, G.J., 2006. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. Wiley/IEEE Press, New Jersey, USA, p.1–72.

    Google Scholar 

  • Wu, M., Wang, D.L., Brown, G.J., 2003. A multipitch tracking algorithm for noisy speech. IEEE Trans. Speech Audio Process., 11(3):229–241. [doi:10.1109/TSA.2003.811539]

    Article  Google Scholar 

  • Zavarehei, E., Vaseghi, S., Qin, Y., 2007. Noisy speech enhancement using harmonic-noise model and codebook-based post-processing. IEEE Trans. Audio Speech Lang. Process., 15(4):1194–1203. [doi:10.1109/TASL.2007.894516]

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pejman Mowlaee.

Additional information

A preliminary version of this paper was presented at the 7th ACS/IEEE International Conference on Computer Systems and Applications, Rabat, Morocco, 2009

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mowlaee, P., Sayadiyan, A. & Sheikhzadeh, H. Evaluating single-channel speech separation performance in transform-domain. J. Zhejiang Univ. - Sci. C 11, 160–174 (2010). https://doi.org/10.1631/jzus.C0910087

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1631/jzus.C0910087

Key words

CLC number

Navigation