Abstract
Speech signal is commonly debased by room reverberation and included noises in genuine climates. This paper focuses on disengaging objective speech signals in reverberant conditions from multichannel input signals. To overcome all the existing drawbacks, this work proposes an efficient technique like, multichannel speech signal separation using a new hybrid method that combines grasshopper optimization-based matrix factorization (GOMF) and enthalpy-based DNN (EDNN). To predict and remove the unwanted noise into the multichannel input signal, this paper presents a narrative classification framework in the manner of following steps namely, STFT, GOMF-based rank estimation, identify signal eigenvalues, noise removal, feature extraction well as classification. At first, STFT is utilized to plan the multichannel blend waveforms to complex spectrograms. Then, GOMF is used to estimates the obvious speech signals and noise. After the estimation, important features are extracted. Feature extraction is based on spatial feature, spectral feature, and directional features. To attain the enhanced classification outcomes, the spectrogram is reconstructed based on enthalpy-based deep neural network (EDNN). At last, convert the resultant speech spectrogram back to the extracted output signal based on inverse STFT. Experimental results show that our proposed approach accomplishes the most extreme SNR outcome of − 6dB of 24.0523. Comparable to the DNN-JAT, which achieves 18.50032. The RNN and NMF-DNN had the worst SNR 13.45434 and 12.29991. The proposed outcome is compared with various algorithms and some existing works. Compared with other existing works our proposed methodology achieves higher outcomes.
Similar content being viewed by others
References
Wang, D., Chen, J.: Supervised speech separation based on deep learning: an overview. IEEE/ACM Trans. Audio Speech Lang. Process. 26(10), 1702–1726 (2018)
Ding, Y., Xu, Y., Zhang, S., Cong, Y., Wang, L.: Self-Supervised Learning for Audio-Visual Speaker Diarization, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 4367–4371
Bahmaninezhad, F., Jian W., Rongzhi G., Shi-Xiong Z., Yong X., Meng Y., Dong Y.: A comprehensive study of speech separation: spectrogram vs waveform separation. arXiv preprint arXiv: 1905.07497 (2019)
Majumder, N., Hazarika, D., Gelbukh, A., Cambria, E., Poria, S.: Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowl.-Based Syst. 161, 124–133 (2018)
Chakrabarty, S., Habets, E.A.: Multi-speaker doa estimation using deep convolutional networks trained with noise signals. IEEE J. Select. Topics Signal Process. 13(1), 8–21 (2019)
Afouras, T., Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Deep audio-visual speech recognition, IEEE transactions on pattern analysis and machine intelligence, 2018
Luo, Y., Mesgarani, N.: Conv-tasnet: surpassing ideal time frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)
Luo, Y., Han, C., Mesgarani, N., Ceolini, E., Liu, S.C.: (2019). FaSNet: Low-latency adaptive beamforming for multi-microphone audio processing. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (pp. 260–267). IEEE
Rongzhi G., Shi-Xiong Z.: Multi-modal Multi-channel Target Speech Separation. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2020
Luo, Yi., Chen, Z., Mesgarani, N.: Speaker-independent speech separation with deep attractor network. IEEE/ACM Trans. Audio Speech Lang. Process. 26(4), 787–796 (2018)
Fan, C., Jianhua T., Bin, L., Jiangyan Y., Zhengqi W.: Gated recurrent fusion of spatial and spectral features for multi-channel speech separation with deep embedding representations. In Proc. Interspeech, vol. 2020. 2020
Shimada, K., Bando, Y., Mimura, M., Itoyama, K., Yoshii, K., Kawahara, T.: Unsupervised speech enhancement based on multichannel NMF-informed beamforming for noise-robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 27(5), 960–971 (2019)
Li, X., Girin, L., Gannot, S., Horaud, R.: Multichannel speech separation and enhancement using the convolutive transfer function. IEEE/ACM Trans. Audio Speech Lang. Process. 27(3), 645–659 (2019)
Gu, R., Shi-Xiong Z., Lianwu C., Yong X., Meng Y., Dan S., Yuexian Z., Dong Y.: Enhancing End-to-End Multi-Channel Speech Separation Via Spatial Feature Learning. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7319–7323. IEEE, 2020
Luo, Yi., Mesgarani, N.: Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)
Luo, Y., Zhuo C., Nima M., Takuya Yoshioka.: End-to-end microphone permutation and number invariant multi-channel speech separation. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6394–6398. IEEE, 2020.
Gu, R., Shi-Xiong Z., Yong X., Lianwu C., Yuexian Z., Dong Y.: Multi-modal multi-channel target speech separation. IEEE J. Select. Topics Signal Process. (2020).
Seki, S., Kameoka, H., Li, Li., Toda, T., Takeda, K.: Underdetermined source separation based on generalized multichannel variationalautoencoder. IEEE Access 7, 168104–168115 (2019)
Yan, C., Biao G., Yuxuan W., Yue G. Deep multi-view enhancement hashing for image retrieval. IEEE Trans. Pattern Anal. Mach. Intell. (2020).
Yan, C., Biyao S., Hao Z., Ruixin N., Yongdong Z., Feng X.: 3d room layout estimation from a single rgb image. IEEE Trans. Multimed. (2020).
Yan, C., Zhisheng L., Yongbing Z., Yutao L., Xiangyang J., Yongdong Z.: Depth image denoising using nuclear norm and learning graph model. arXiv preprint:2008.03741(2020).
Yicheng, D., Kouhei, S.: Semi-supervised Multichannel Speech Separation Based on a Phone- and Speaker-Aware Deep Generative Model of Speech Spectrograms, National Institute of Advanced Industrial Science and Technology, 2020.
Ozerov, A., Fevotte, C.: Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation. IEEE Trans. Audio Speech Lang. Process. 18(3), 550–563 (2010)
Saito, Y., Takamichi, S., Saruwatari, H.: Vocoder-free text-to-speech synthesis incorporating generative adversarial networks using low-/multi-frequency STFT amplitude spectra. Comput. Speech Lang. 1(58), 347–363 (2019)
Mitsufuji, Y., Uhlich, S., Takamune, N., Kitamura, D., Koyama, S., Saruwatari, H.: Multichannel non-negative matrix factorization using banded spatial covariance matrices in wavenumber domain. IEEE Trans. Audio Speech Lang. Process. 28, 49–60 (2020)
Wang, Z.-Q., Wang, P., Wang, D.: Complex spectral mapping for single- and multi-channel speech enhancement and robust ASR. IEEE Trans. Audio Speech Lang. Process. 28, 1778–1787 (2020)
Yoshiki, M.T., Tatsuya K.: Consistency-aware multi-channel speech enhancement using deep neural networks. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 821–825. IEEE, 2020.
Martinez, A.M., Gerlach, L., Payá-Vayá, G., Hermansky, H., Ooster, J., Meyer, B.T.: DNN-based performance measures for predicting error rates in automatic speech recognition and optimizing hearing aid parameters. Speech Commun. 1(106), 44–56 (2019)
Ravi Kishore, T., Sai Sidharth, D.: Analysis of Linear and Non-Linear Frequency Modulated Signals Using STFT and Hough Transform, Signal Process. Inf. Technol., 2015.
Ewees, A. A., Mohamed A.E., Essam H.H.: Improved grasshopper optimization algorithm using opposition-based learning. Expert Syst. Appl. 112 (2018): 156–172.
Zhang, X., Wang, D.L.: Deep learning based binaural speech separation in reverberant environments. IEEE Trans. Audio Speech Lang. Process. 25(5), 1075–1084 (2017)
Gu, R., Lianwu C., Shi-Xiong Z., Jimeng Z., Yong X., Meng Y., Dan S., Yuexian Z., Dong Y.: Neural Spatial Filter: Target Speaker Speech Separation Assisted with Directional Information. In Interspeech, pp. 4290–4294. 2019.
Lozano-Diez, A., Zazo, R., Toledano, D.T., Gonzalez-Rodriguez, J.: An analysis of the influence of deep neural network (DNN) topology in bottleneck feature based language recognition. PLoS ONE 12(8), e0182580 (2017)
Narayanan, A., Wang, D.: Improving robustness of deep neural network acoustic models via speech separation and joint adaptive training. IEEE Trans. Audio Speech Lang. Process. 23(1), 92–101 (2015)
Geiger, J. T., Weninger, F., Gemmeke, J., Wollmer, M., Schuller, B., Rigoll, G.: Memory-enhanced neural networks and NMF for robust ASR, IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, no. 6, pp. 1037–1046, Jun. 2014.
Weng, S. W. C., Yu, D., Juang, B.-H.: Recurrent deep neural networks for robust speech recognition, in Proc. IEEE Int. Conf. Acoust, Speech, Signal Process., 2014, pp. 5532–5536.
Weninger, F., Geiger, J., Wöllmer, M., Schuller, B., Rigoll, G.: The Munich feature enhancement approach to the 2nd CHiME challenge using BLSTM recurrent neural networks, in Proc. 2nd CHiME Workshop Mach. Listening Multisource Environ., 2013, pp. 86–90.
Jin, Y.L., Chen J.T., QianHong L., Yan W.: Multi-Head Self-Attention Based Deep Clustering For Single-Channel Speech Separation. IEEE Access (2020).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
Conflict of interest the authors declare that they have no conflict of interest.
Additional information
Communicated by Y. Zhang.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Koteswararao, Y.V., Rao, C.B.R. Multichannel speech separation using hybrid GOMF and enthalpy-based deep neural networks. Multimedia Systems 27, 271–286 (2021). https://doi.org/10.1007/s00530-020-00740-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00530-020-00740-y