Abstract
In this paper, we propose a new autoencoder network architecture with clustering mechanism for underdetermined blind speech source separation, i.e., the number of mixtures is less than that of sources. The autoencoder network is employed to project the mixtures to embedding space and obtain their embedding vectors. The network model additionally incorporates the clustering mechanism and nearest neighbor clustering algorithm to estimate the clustering centers of the embedding vectors. Then, according to the embedding vectors, the hard and the probability assignment method are proposed to assign the mixtures to their corresponding clusters to recover the sources. The experimental results demonstrate that the proposed method yields better performance compared to the baseline algorithms.
Similar content being viewed by others
References
Bishop CM (2006) Pattern recognition and machine learning (information science and statistics). Springer, Netherlands
Bofill P, Zibulevsky M (2001) Underdetermined blind source separation using sparse representations. Signal Process 81(11):2353–2362
Chen P, Peng D, Zhen L, Luo Y, Xiang Y (2017) Underdetermined blind separation by combining sparsity and independence of sources. IEEE Access 5:21,731–21,742
Cichocki A, Phan AH, Caiafa C (2008) Flexible hals algorithms for sparse non-negative matrix/tensor factorization. In: 2008 IEEE Workshop on machine learning for signal processing, pp 73–78
Dargan S, Kumar M, Ayyagari MR, Gulshan K (2020) A survey of deep learning and its applications: a new paradigm to machine learning. Archives of Computational Methods in Engineering 27(4):1071–1092
Elmannai H, Loghmari MA, Naceur MS (2015) Two levels fusion decision for multispectral image pattern recognition. ISPRS Annals of the Photogrammetry Remote Sensing and Spatial Information Sciences II22:69–74
Fan N, Du J, Dai L (2016) A regression approach to binaural speech segregation via deep neural network. In: 2016 10Th international symposium on chinese spoken language processing (ISCSLP), pp 1–5
Gavrilescu M (2014) Improved automatic speech recognition system using sparse decomposition by basis pursuit with deep rectifier neural networks and compressed sensing recomposition of speech signals. In: 2014 10Th international conference on communications (COMM), pp 1–6
Guo Y, Naik GR, Nguyen H (2013) Single channel blind source separation based local mean decomposition for biomedical applications. In: 2013 35Th annual international conference of the IEEE engineering in medicine and biology society, pp 6812–6815
Han C, Luo Y, Mesgarani N (2019) Online deep attractor network for real-time single-channel speech separation. In: ICASSP 2019 - 2019 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 361–365
Hershey JR, Chen Z, Le Roux J, Watanabe S (2016) Deep clustering: Discriminative embeddings for segmentation and separation. In: 2016 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 31–35
Hinton G, Deng L, Yu D, Dahl GE, Mohamed A, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TN, Kingsbury B (2012) Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Proc Mag 29(6):82–97
Jiang Y, Wang D, Liu R, Feng Z (2014) Binaural classification for reverberant speech segregation using deep neural networks. IEEE/ACM Transactions on Audio Speech and Language Processing 22(12):2112–2121
Jun H, Chen Y, Zhang Q, Sun G, Hu Q (2018) Blind source separation method for bearing vibration signals. IEEE Access 6:658–664
Keriven N, Deleforge A, Liutkus A (2018) Blind source separation using mixtures of alpha-stable distributions. In: 2018 IEEE International conference on acoustics, speech and signal processing, pp 771–775
Kolbæk M, Yu D, Tan Z, Jensen J (2017) Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Transactions on Audio Speech and Language Processing 25 (10):1901–1913
Li H, Zhang X (2012) Blind separation of noisy mixed speech based on independent component analysis and neural network. In: 2012 International conference on computing, measurement, control and sensor network, pp 105–108
Li H, Zhang X (2012) Blind separation of noisy mixed speech based on independent component analysis and neural network. In: 2012 International conference on computing, measurement, control and sensor network, pp 105–108
Ma L, Wang C, Baihua X (2012) Sparse representation based on matrix rank minimization and k-means clustering for recognition. In: The 2012 international joint conference on neural networks (IJCNN), pp 1–8
Makino S, Sawada H, Lee TW (2007) Blind speech separation. Springer, Netherlands
Mohammadiha N, Smaragdis P, Leijon A (2013) Supervised and unsupervised speech enhancement using nonnegative matrix factorization. IEEE Transactions on Audio Speech, and Language Processing 21(10):2140–2151
Pandey A, Wang D (2019) A new framework for cnn-based speech enhancement in the time domain. IEEE/ACM Transactions on Audio Speech and Language Processing 27(7):1179–1188
Saab R, Yilmaz O, McKeown MJ, Abugharbieh R (2007) Underdetermined anechoic blind source separation via ℓq-basis-pursuit with q ≪ 1. IEEE Transactions on Signal Processing 55(8):4004–4017
Salaün Y, Vincent E, Bertin N, Souviraà-Labastie N, Jaureguiberry X, Tran DT, Bimbot F (2014) The flexible audio source separation toolbox version 2.0. In: IEEE International conference on acoustics, speech and signal processing
Sarker IH (2021) Deep learning: a comprehensive overview on techniques, taxonomy, applications and research directions. SN Computer Science 2(6):420
Tao X, Wenwu W (2009) A compressed sensing approach for underdetermined blind audio source separation with sparse representation. In: 2009 IEEE/SP 15Th workshop on statistical signal processing, pp 493–496
Vassil P, Guoguo C, Daniel P, Sanjeev K (2015) Librispeech: an asr corpus based on public domain audio books. http://www.openslr.org
Vincent E, Gribonval R, Fevotte C (2006) Performance measurement in blind audio source separation. IEEE Transactions on Audio Speech and Language Processing 14(4):1462–1469
Wang Z, Le Roux J, Hershey JR (2018) Multi-channel deep clustering: Discriminative spectral and spatial embeddings for speaker-independent speech separation. In: 2018 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 1–5
Wang S, Naithani G, Virtanen T (2019) Low-latency deep clustering for speech separation. In: ICASSP 2019 - 2019 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 76–80
Williamson DS, Wang D (2017) Time-frequency masking in the complex domain for speech dereverberation and denoising. IEEE/ACM Transactions on Audio Speech, and Language Processing 25(7):1492–1501
Yang Z, Xiang Y, Xie K, Lai Y (2017) Adaptive method for nonsmooth nonnegative matrix factorization. IEEE Transactions on Neural Networks and Learning Systems 28(4):948–960
Ye Z, Kang C, Kangrui W, Tenglong Y, Nanrun Z (2014) Audio-visual underdetermined blind source separation algorithm based on gaussian potential function. China Commun 11(6):71–80
Yu D, Kolbæk M, Tan Z, Jensen J (2017) Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In: 2017 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp 241–245
Zermini A, Liu Q, Xu Y, Plumbley MD, Betts D, Wang W (2017) Binaural and log-power spectra features with deep neural networks for speech-noise separation. In: 2017 IEEE 19Th international workshop on multimedia signal processing (MMSP), pp 1–6
Zhang Y, Cao K, Wu K, Yu T (2012) Using gaussian potential function for underdetermined blind sources separation based on duet. In: Artificial intelligence and computational intelligenc, pp 75–81
Zhang S, Xie W, Zhu H, Zhao H (2017) Combined eigenvector analysis and independent component analysis for multi-component periodic interferences suppression in prcpm-pd detection system. IEEE Access 5:12,552–12,562
Zhen L, Peng D, Zhang H, Sang Y, Zhang L (2020) Underdetermined mixing matrix estimation by exploiting sparsity of sources. Measurement, 152
Zibulevsky M, Pearlmutter BA (2014) Blind source separation by sparse decomposition in a signal dictionary. Neural Comput 13(4):863–882
Zuyuan Y, Guoxu Z, Shengli X, Shuxue D, Jun-Mei Y, Jun Z (2011) Blind spectral unmixing based on sparse nonnegative matrix factorization. IEEE Trans Image Process 20(4):1112–1125
Acknowledgements
This work was supported by the National Natural Science Foundation of China (Grant No. 61866024).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work was supported by the National Natural Science Foundation of China (Grant No. 61866024).
Rights and permissions
About this article
Cite this article
Niu, M., Zhang, Y. Underdetermined blind speech source separation based on deep nearest neighbor clustering algorithm. Multimed Tools Appl 82, 1171–1183 (2023). https://doi.org/10.1007/s11042-022-13009-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-13009-5