Improvement in monaural speech separation using sparse non-negative tucker decomposition

Varshney, Yash Vardhan; Upadhyaya, Prashant; Abbasi, Zia Ahmad; Abidi, Musiur Raza; Farooq, Omar

doi:10.1007/s10772-018-9550-5

Improvement in monaural speech separation using sparse non-negative tucker decomposition

Published: 05 September 2018

Volume 21, pages 837–849, (2018)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Yash Vardhan Varshney ORCID: orcid.org/0000-0001-9254-8986¹,
Prashant Upadhyaya¹,
Zia Ahmad Abbasi¹,
Musiur Raza Abidi¹ &
…
Omar Farooq¹

172 Accesses
Explore all metrics

Abstract

A monaural speech separation/enhancement technique based on non-negative tucker decomposition (NTD) has been introduced in this paper. In the proposed work, the effect of sparsity regularization factor on the separation of mixed signal is included in the generalized cost function of NTD. By using the proposed algorithm, the vector components of both target and mixed signal can be exploited and used for the separation of any monaural mixture. Experiment was done on the monaural data generated by mixing the speech signals from two speakers and, by mixing noise and speech signals using TIMIT and noisex-92 dataset. The separation results are compared with the other existing algorithms in terms of correlation of separated signal with the original signal, signal to distortion ratio, perceptual evaluation of speech quality and short-time objective intelligibility. Further, to get more conclusive information about separation ability, speech recognition using Kaldi toolkit was also performed. The recognition results are compared in terms of word error rate (WER) using the MFCC based features. Results show the average improved WER using proposed algorithm over the nearest performing algorithm is up to 2.7% for mixed speech of two speakers and 1.52% for noisy speech input.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Approximated Sparsity Regularization Factor for Monaural Speech Separation

A study on unsupervised monaural reverberant speech separation

Article 09 May 2020

Dual transform based joint learning single channel speech separation using generative joint dictionary learning

Article 02 April 2022

References

Anastasakos, T., McDonough, J., & Makhoul, J. (1997). Speaker adaptive training: A maximum likelihood approach to speaker normalization. In IEEE international conference on acoustics, speech, and signal processing (pp. 1043–1046).
Bavkar, S. (2013). PCA based single channel speech enhancement method for highly noisy environment. In Advances in computing, communications and informatics (ICACCI) (pp. 1103–1107).
Bertin, N., Févotte, C., & Badeau, R. (2009). A tempering approach for Itakura-Saito non-negative matrix factorization. With application to music transcription. In Proceedings of ICASSP, IEEE international conference on acoustics, speech and signal processing (pp. 1545–1548).
Bouguelia, M. R., Nowaczyk, S., Santosh, K. C., & Verikas, A. (2018). Agreeing to disagree: active learning with noisy labels without crowdsourcing. International Journal of Machine Learning and Cybernetics, 9, 1307–1319. https://doi.org/10.1007/s13042-017-0645-0.
Article Google Scholar
Cooke, M., Hershey, J. R., & Rennie, S. J. (2010). Monaural speech separation and recognition challenge. Computer Speech & Language, 24, 1–15. https://doi.org/10.1016/j.csl.2009.02.006.
Article Google Scholar
Dey, N., & Ashour, A. S. (2018a). Applied examples and applications of localization and tracking problem of multiple speech sources. In Direction of arrival estimation and localization of multi-speech sources (pp. 35–48). Cham: Springer.
Chapter Google Scholar
Dey, N., & Ashour, A. S. (2018b). Challanges and future perspectives in speech-sources direction of arrival estimation and localization. In Direction of arrival estimation and localization of multi-speech sources (pp. 49–52). Cham: Springer.
Chapter Google Scholar
Févotte, C. (2011). Majorization-minization algorithm for smooth Itakuro-Saito non-negative matrix factorization. Compute 1980–1983. https://doi.org/10.1109/ICASSP.2011.5946898.
Févotte, C., Bertin, N., & Durrieu, J.-L. (2009). Nonnegative matrix factorization with the Itakura-Saito divergence: With application to music analysis. Neural Computation, 21, 793–830. https://doi.org/10.1162/neco.2008.04-08-771.
Article MATH Google Scholar
Févotte, C., Gribonval, R., & Vincent, E. (2005). BSS EVAL Toolbox User Guide. Tech Rep 1706, IRISA.
Gales, M. J. F. (1998). Maximum likelihood linear transformations for HMM-based speech recognition. Computer Speech and Language, 12, 75–98. https://doi.org/10.1006/csla.1998.0043.
Article Google Scholar
Garofolo, J., Lamel, L., & Fisher, W., et al. (1988). Getting started with the DARPA TIMIT CD-ROM: An acoustic phonetic continuous speech database. National Institute of Standards and Technology (NIST), Gaithersburg, MD, USA.
Guan, N., Lan, L., & Tao, D., et al. (2014). Transductive nonnegative matrix factorization for semi-supervised high-performance speech separation. In Proceedings of ICASSP, IEEE international conference on acoustics, speech and signal processing (pp 2534–2538).
Hoyer, P. O. (2004). Non-negative matrix factorization with sparseness constraints. Journal of Machine Learning Research, 5, 1457–1469. https://doi.org/10.1109/ICMLC.2011.6016966.
MathSciNet MATH Google Scholar
ITU. (2001). Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs. In ITU-T recommendation (pp. 1–32).
Jolliffe, I. T. (2002). Principal component analysis (2nd ed.). Berlin: Springer
Khademian, M., & Mehdi, M. (2016). Monaural multi-talker speech recognition using factorial speech processing models. 1–28.
Kim, Y.-D. & Choi, S. (2007). Nonnegative tucker decomposition. 1–8. https://doi.org/10.1109/CVPR.2007.383405.
Kolda, T. G. (2006) Multilinear operators for higher-order decompositions, SANDIA Report SAND2006-2081.
Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401, 788–791. https://doi.org/10.1038/44565.
Article MATH Google Scholar
Lef, A., & Bach, F. (2011). Online algorithms for nonnegative matrix factorization with the Itakura-Saito divergence to cite this version: online algorithms for nonnegative matrix factorization with the Itakura-Saito divergence.
Lin, C.-J. (2007). On the convergence of multiplicative update for nonnegative matrix factorization. IEEE Transactions on Neural Networks and Learning Systems, 18, 1589–1596.
Article Google Scholar
Liu, J., Liu, J., Wonka, P., & Ye, J. (2012). Sparse non-negative tensor factorization using columnwise coordinate descent. Pattern Recognition, 45, 649–656.
Article MATH Google Scholar
Mallat, S. (1998) A wavelet tour of signal processing: the sparse way (3rd ed.). Cambridge: Academic Press.
Mirzal, A. (2017). NMF versus ICA for blind source separation. Advances in Data Analysis and Classification, 11, 25–48. https://doi.org/10.1007/s11634-014-0192-4.
Article MathSciNet Google Scholar
Mørup, M., & Hansen, L. K. (2009) Tuning pruning in sparse non-negative matrix factorization. In European signal processing conference (pp. 1923–1927).
Mukherjee, H., Obaidullah, S. M., & Santosh, K. C., et al. (2018). Line spectral frequency-based features and extreme learning machine for voice activity detection from audio signal. International Journal of Speech Technology. https://doi.org/10.1007/s10772-018-9525-6.
Google Scholar
Park, H.-M., Jung, H.-Y., Lee, T.-W., & Lee, S.-Y. (1999). Subband-based blind signal separation for noisy speech recognition. Electronics Letters, 35, 982–984. https://doi.org/10.1049/el:19991358.
Article Google Scholar
Plátek, O. (2014). Automatic speech recognition using Kaldi. Charles University in Prague.
Povey, D., Ghoshal, A., Boulianne, G., et al. (2011). The Kaldi speech recognition toolkit. In IEEE workshop on automatic speech recognition and understanding (pp. 1–4). https://doi.org/10.1017/CBO9781107415324.004.
Rioul, O., & Duhamel, P. (1992). Fast algorithms for discrete and continuous wavelet transforms. IEEE Transactions on Information Theory, 38, 569–586. https://doi.org/10.1109/18.119724.
Article MathSciNet MATH Google Scholar
Schmidt, M., Winther, O., & Hansen, L. K. (2009). Bayesian non-negative matrix factorization. In Independent component analysis and signal separation (pp. 540–547).
Stern, R. M. (2003). Signal separation motivated by human auditory perception: Applications to automatic speech recognition. In NSF symposium on speech separation.
Taal, C. H., Hendriks, R. C., Heusdens, R., & Jensen, J. (2011). An algorithm for intelligibility prediction of time—Frequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Processing, 19, 2125–2136.
Article Google Scholar
Upadhyaya, P., Mittal, S. K., Varshney, Y. V., et al. (2017) Speaker adaptive model for hindi speech using Kaldi speech recognition toolkit. In International conference on multimedia, signal processing and communication technologies (IMPACT) (pp. 222–226).
Varga, A., & Steeneken, H. J. M. (1993). Assessment for automatic speech recognition:{II}. {NOISEX-92}: A database and an experiment to study the effct of additive noise on speech recognition systems. Speech Communication, 12, 247–251.
Article Google Scholar
Varshney, Y. V., Abbasi, Z. A., Abidi, M. R., & Farooq, O. (2017a). Variable sparsity regularization factor based SNMF for monaural speech separation. In 2017 40th international conference on telecommunications and signal processing, TSP 2017.
Varshney, Y. V., Abbasi, Z. A., Abidi, M. R., & Farooq, O. (2017b). Frequency selection based separation of speech signals with reduced computational time using sparse NMF. Archives of Acoustics, 42, 287–295. https://doi.org/10.1515/aoa-2017-0031.
Article Google Scholar
Vincent, E., Gribonval, R., & F´evotte, C. (2006). Performance measurement in blind audio source separation. IEEE Transactions on Audio, Speech, and Language Processing Institute of Electrical and Electronics Engineers, 14, 1462–1469.
Google Scholar
Virtanen, T., Cemgil, A. T., & Godsill, S. (2008). Bayesian extensions to non-negative matrix factorisation for audio signal modelling. In Proceedings of ICASSP, IEEE international conference on acoustics, speech, and signal processing (pp. 1825–1828). https://doi.org/10.1109/ICASSP.2008.4517987.
Young, S., Hain, T., & Woodland, P., et al. (2002). The HTK book (for version 3.2.1). Cambridge: Cambridge University Engineering Department.
Yuan, Z., Yang, Z., & Oja, E. (2007) Projective nonnegative matrix factorization: Sparseness, orthogonality, and clustering. Helsinki University of Technology 1–14.
Zhou, G., Cichocki, A., Zhao, Q., & Xie, S. (2015). Efficient nonnegative tucker decompositions: Algorithms and uniqueness. IEEE Transactions on Image Processing, 24, 4990–5003. https://doi.org/10.1109/TIP.2015.2478396.
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electronics Engineering, Aligarh Muslim University, Aligarh, India
Yash Vardhan Varshney, Prashant Upadhyaya, Zia Ahmad Abbasi, Musiur Raza Abidi & Omar Farooq

Authors

Yash Vardhan Varshney
View author publications
You can also search for this author in PubMed Google Scholar
Prashant Upadhyaya
View author publications
You can also search for this author in PubMed Google Scholar
Zia Ahmad Abbasi
View author publications
You can also search for this author in PubMed Google Scholar
Musiur Raza Abidi
View author publications
You can also search for this author in PubMed Google Scholar
Omar Farooq
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yash Vardhan Varshney.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Varshney, Y.V., Upadhyaya, P., Abbasi, Z.A. et al. Improvement in monaural speech separation using sparse non-negative tucker decomposition. Int J Speech Technol 21, 837–849 (2018). https://doi.org/10.1007/s10772-018-9550-5

Download citation

Received: 01 February 2018
Accepted: 20 August 2018
Published: 05 September 2018
Issue Date: 15 December 2018
DOI: https://doi.org/10.1007/s10772-018-9550-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improvement in monaural speech separation using sparse non-negative tucker decomposition

Abstract

Access this article

Similar content being viewed by others

Approximated Sparsity Regularization Factor for Monaural Speech Separation

A study on unsupervised monaural reverberant speech separation

Dual transform based joint learning single channel speech separation using generative joint dictionary learning

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Improvement in monaural speech separation using sparse non-negative tucker decomposition

Abstract

Access this article

Similar content being viewed by others

Approximated Sparsity Regularization Factor for Monaural Speech Separation

A study on unsupervised monaural reverberant speech separation

Dual transform based joint learning single channel speech separation using generative joint dictionary learning

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation