Evaluating single-channel speech separation performance in transform-domain

Mowlaee, Pejman; Sayadiyan, Abolghasem; Sheikhzadeh, Hamid

doi:10.1631/jzus.C0910087

Evaluating single-channel speech separation performance in transform-domain

Published: 07 February 2010

Volume 11, pages 160–174, (2010)
Cite this article

Journal of Zhejiang University SCIENCE C Aims and scope Submit manuscript

Pejman Mowlaee¹,
Abolghasem Sayadiyan¹ &
Hamid Sheikhzadeh¹

85 Accesses
9 Citations
Explore all metrics

Abstract

Single-channel separation (SCS) is a challenging scenario where the objective is to segregate speaker signals from their mixture with high accuracy. In this research a novel framework called subband perceptually weighted transformation (SPWT) is developed to offer a perceptually relevant feature to replace the commonly used magnitude of the short-time Fourier transform (STFT). The main objectives of the proposed SPWT are to lower the spectral distortion (SD) and to improve the ideal separation quality. The performance of the SPWT is compared to those obtained using mixmax and Wiener filter methods. A comprehensive statistical analysis is conducted to compare the SPWT quantization performance as well as the ideal separation quality with other features of log-spectrum and magnitude spectrum. Our evaluations show that the SPWT provides lower SD values and a more compact distribution of SD, leading to more acceptable subjective separation quality as evaluated using the mean opinion score.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

Article Open access 03 January 2024

Mahendra Kumar Gourisaria, Rakshit Agrawal, … Pradeep Kumar Singh

Milestones in speaker recognition

Article Open access 15 February 2024

R. Sharma, D. Govind, … S. R. M. Prasanna

Review of wavelet denoising algorithms

Article 03 April 2023

Aminou Halidou, Youssoufa Mohamadou, … Edinio Jocelyn Gbadoubissa Zacko

References

Bach, F.R., Jordan, M.I., 2006. Learning spectral clustering, with application to speech separation. J. Mach. Learn. Res., 7(1):1963–2001.
MathSciNet Google Scholar
Barker, J., Shao, X., 2007. Audio-Visual Speech Fragment Decoding. Proc. Int. Conf. on Auditory-Visual Speech Processing, p.37–42.
Barker, J., Cooke, M., Ellis, D., 2005. Decoding speech in the presence of other sources. Speech Commun., 45(1):5–25. [doi:10.1016/j.specom.2004.05.002]
Article Google Scholar
Barker, J., Coy, A., Ma, N., Cooke, M., 2006. Recent Advances in Speech Fragment Decoding Techniques. 9th Int. Conf. on Spoken Language Processing, p.85–88.
Benaroya, L., Bimbot, F., Gribonval, R., 2006. Audio source separation with a single sensor. IEEE Trans. Audio Speech Lang. Process., 14(1):191–199. [doi:10.1109/TSA.2005.854110]
Article Google Scholar
Bishop, C.M., 2006. Pattern Recognition and Machine Learning. Information Science and Statistics Series. Springer, New York, USA, p.2–3. [doi:10.1007/978-0-387-45528-0]
Google Scholar
Chatterjee, S., Sreenivas, T.V., 2008. Predicting VQ performance bound for LSF coding. IEEE Signal Process. Lett., 15(1):166–169. [doi:10.1109/LSP.2007.914786]
Article Google Scholar
Chhikara, R., Folks, L., 1989. The Inverse Gaussian Distribution: Theory, Methodology and Applications. CRC Press, Marcel Dekker Inc., New York, USA, p.39–52.
MATH Google Scholar
Christensen, M.G., Jakobsson, A., 2009. Multi-Pitch Estimation. Synthesis Lectures on Speech and Audio Processing. Morgan and Claypool Publishers, San Rafael, CA, USA, p.1–24. [doi:10.2200/S00178ED1V01Y200903SAP005]
Google Scholar
Cooke, M.P., Barker, J., Cunningham, S.P., Shao, X., 2006. An audiovisual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am., 120(5):2421–2424. [doi:10.1121/1.2229005]
Article Google Scholar
Ellis, D.P.W., Weiss, R.J., 2006. Model-Based Monaural Source Separation Using a Vector-Quantized Phase-Vocoder Representation. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, p.957–960. [doi:10.1109/ICASSP.2006.1661436]
Gardner, W., Rao, B., 1995. Theoretical analysis of the high rate vector quantization of LPC parameters. IEEE Trans. Speech Audio Process., 3(5):367–381. [doi:10.1109/89.466658]
Article Google Scholar
Gersho, A., Gray, R.M., 1992. Vector Quantization and Signal Compression. Kluwer Academic Publishers, Boston, USA, p.345–372.
MATH Google Scholar
Gray, R.M., 1990. Source Coding Theory. Kluwer International Series in Engineering and Computer Science. Kluwer Academic Publishers, Boston, USA, p.43.
Google Scholar
Gu, L.Y., Stern, R.M., 2008. Single-Channel Speech Separation Based on Modulation Frequency. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, p.25–28.
Hai, L.V., Lois, L., 1998. A New General Distance Measure for Quantization of LSF and Their Transformed Coefficients. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, p.45–48.
Hendriks, R.C., Rainer, M., 2007. MAP estimators for speech enhancement under normal and Rayleigh inverse Gaussian distributions. IEEE Trans. Audio Speech Lang. Process., 15(3):918–927. [doi:10.1109/TASL.2006.889753]
Article Google Scholar
Hermansky, H., 1990. Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am., 87(4):1738–1752. [doi:10.1121/1.399423]
Article Google Scholar
Hermansky, H., Morgan, N., 1994. RASTA processing of speech. IEEE Trans. Speech Audio Process., 2(4):578–589. [doi:10.1109/89.326616]
Article Google Scholar
Hu, G., Wang, D., 2004. Monaural speech segregation based on pitch tracking and amplitude modulation. IEEE Trans. Neur. Networks, 15(5):1135–1150. [doi:10.1109/TNN.2004.832812]
Article Google Scholar
ITU-T P.862, 2001. Perceptual Evaluation of Speech Quality (PESQ): An Objective Method for End-to-End Speech Quality Assessment of Narrow-Band Telephone Networks and Speech Codecs. International Telecommunication Union, Geneva.
Google Scholar
Jensen, J., Heusdens, R., Jensen, S.H., 2003. A Perceptual Subspace Method for Sinusoidal Speech and Audio Modeling. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, p.401–404.
Kondoz, A.M., Evans, B.G., 1987. Hybrid Transform Coder for Low Bit Rate Speech Coding. Proc. European Conf. on Speech Technology, p.105–108.
Kondoz, A.M., Evans, B.G., 1988. A Robust Vector Quantized Sub-Band Coder for Good Quality Speech Coding at 9.6 Kb/s. IEEE 8th European Conf. on Area Communication, p.44–47.
Kristijansson, T., Hershey, J., Olsen, P., Rennie, S., Gopinath, R., 2006. Super-Human Multi-Talker Speech Recognition: The IBM Speech Separation Challenge System. 9th Int. Conf. on Spoken Language Processing, p.97–100.
Li, P., Guan, Y., Wang, S., Xu, B., Liu, W., 2010. Monaural speech separation based on MAXVQ and CASA for robust speech recognition. Comput. Speech & Lang., 24(1):30–44. [doi:10.1016/j.csl.2008.05.005]
Article Google Scholar
Loizou, P., 2007. Speech Enhancement Theory and Practice. CRC Press, Boca Raton, FL, USA, p.143.
Google Scholar
Ma, J., Hu, Y., Loizou, P., 2009. Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions. J. Acoust. Soc. Am., 125(5):3387–3405. [doi:10.1121/1.3097493]
Article Google Scholar
Martin, R., 2005. Speech enhancement based on minimum square error estimation and super-Gaussian priors. IEEE Trans. Speech Audio Process., 13(5):845–856. [doi:10.1109/TSA.2005.851927]
Article Google Scholar
Moore, B.C.J., 1997. An Introduction to the Psychology of Hearing (4th Ed.). Academic Press, New York, San Diego, USA, p.89–103.
Google Scholar
Mowlaee, P., Sayadiyan, A., 2008. Model-Based Monaural Sound Separation by Split-VQ of Sinusoidal Parameters. 16th European Signal Processing Conf., p.1–5.
Mowlaee, P., Sayadiyan, A., 2009. Performance Evaluation for Transform Domain Model-Based Single-Channel Speech Separation. 7th ACS/IEEE Int. Conf. on Computer Systems and Applications, p.935–942. [doi:10.1109/AICCSA.2009.5069444]
Paliwal, K.K., Kleijn, W.B., 1995. Quantization of LPC Parameters. In: Kleijn, W.B., Paliwal, K.K. (Eds.), Speech Coding and Synthesis. Elsevier, Amsterdam, the Netherlands, p.443–466.
Google Scholar
Radfar, M.H., Sayadiyan, A., Dansereau, R.M., 2006a. A New Algorithm for Two-Talker Pitch Tracking in Single Channel Paradigm. Int. Conf. on Signal Processing.
Radfar, M.H., Dansereau, R.M., Sayadiyan, A., 2006b. Performance Evaluation of Three Features for Model-Based Single Channel Speech Separation Problem. 8th Int. Conf. on Spoken Language Processing, p.2610–2613.
Radfar, M.H., Dansereau, R.M., Sayadiyan, A., 2007. A maximum likelihood estimation of vocal-tract-related filter characteristics for single channel speech separation. EURASIP J. Audio Speech Music Process., 2007:Article ID 84186, p.1–15. [doi:10.1155/2007/84186]
Article Google Scholar
Reddy, A.M., Raj, B., 2007. Soft mask methods for single-channel speaker separation. IEEE Trans. Audio Speech Lang. Process., 15(6):1766–1776. [doi:10.1109/TASL.2007.901310]
Article Google Scholar
Roweis, S., 2003. Factorial Models and Refiltering for Speech Separation and Denoising. 8th European Conf. on Speech Communication and Technology, p.1009–1012.
So, S., Paliwal, K., 2007. A comparative study of LPC parameter representations and quantisation schemes for wideband speech coding. Dig. Signal Process., 17(1):114–137. [doi:10.1016/j.dsp.2005.10.002]
Article Google Scholar
Spiegel, M.R., Lipschutz, S., Liu, J., 1998. Schaum’s Mathematical Handbook of Formulas and Tables. McGraw-Hill, New York, USA, p.111.
Google Scholar
Srinivasan, S., Wang, D., 2008. A model for multitalker speech perception. J. Acoust. Soc. Am., 124(5):3213–3224. [doi:10.1121/1.2982413]
Article Google Scholar
Srinivasan, S., Shao, Y., Jin, Z., Wang, D.L., 2006. A Computational Auditory Scene Analysis System for Robust Speech Recognition. 9th Int. Conf. on Spoken Language Processing, p.73–76.
Stevens, J.C., Marks, L.E., 1965. Cross-modality matching of brightness and loudness. PNAS, 54(2):407–411. [doi:10.1073/pnas.54.2.407]
Article Google Scholar
Tolonen, T., Karjalainen, M., 2000. A computationally efficient multipitch analysis model. IEEE Trans. Speech Audio Process., 8(6):708–716. [doi:10.1109/89.876309]
Article Google Scholar
Wang, D.L., Brown, G.J., 2006. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. Wiley/IEEE Press, New Jersey, USA, p.1–72.
Google Scholar
Wu, M., Wang, D.L., Brown, G.J., 2003. A multipitch tracking algorithm for noisy speech. IEEE Trans. Speech Audio Process., 11(3):229–241. [doi:10.1109/TSA.2003.811539]
Article Google Scholar
Zavarehei, E., Vaseghi, S., Qin, Y., 2007. Noisy speech enhancement using harmonic-noise model and codebook-based post-processing. IEEE Trans. Audio Speech Lang. Process., 15(4):1194–1203. [doi:10.1109/TASL.2007.894516]
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electronic Engineering, Amirkabir University of Technology, Tehran, 15875-4413, Iran
Pejman Mowlaee, Abolghasem Sayadiyan & Hamid Sheikhzadeh

Authors

Pejman Mowlaee
View author publications
You can also search for this author in PubMed Google Scholar
Abolghasem Sayadiyan
View author publications
You can also search for this author in PubMed Google Scholar
Hamid Sheikhzadeh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pejman Mowlaee.

Additional information

A preliminary version of this paper was presented at the 7th ACS/IEEE International Conference on Computer Systems and Applications, Rabat, Morocco, 2009

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mowlaee, P., Sayadiyan, A. & Sheikhzadeh, H. Evaluating single-channel speech separation performance in transform-domain. J. Zhejiang Univ. - Sci. C 11, 160–174 (2010). https://doi.org/10.1631/jzus.C0910087

Download citation

Received: 12 February 2009
Accepted: 25 June 2009
Published: 07 February 2010
Issue Date: March 2010
DOI: https://doi.org/10.1631/jzus.C0910087

Key words

CLC number

TN912.3

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evaluating single-channel speech separation performance in transform-domain

Abstract

Access this article

Similar content being viewed by others

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

Milestones in speaker recognition

Review of wavelet denoising algorithms

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Key words

CLC number

Navigation

Evaluating single-channel speech separation performance in transform-domain

Abstract

Access this article

Similar content being viewed by others

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

Milestones in speaker recognition

Review of wavelet denoising algorithms

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

CLC number

Search

Navigation