Skip to main content
Log in

Multichannel speech separation using hybrid GOMF and enthalpy-based deep neural networks

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Speech signal is commonly debased by room reverberation and included noises in genuine climates. This paper focuses on disengaging objective speech signals in reverberant conditions from multichannel input signals. To overcome all the existing drawbacks, this work proposes an efficient technique like, multichannel speech signal separation using a new hybrid method that combines grasshopper optimization-based matrix factorization (GOMF) and enthalpy-based DNN (EDNN). To predict and remove the unwanted noise into the multichannel input signal, this paper presents a narrative classification framework in the manner of following steps namely, STFT, GOMF-based rank estimation, identify signal eigenvalues, noise removal, feature extraction well as classification. At first, STFT is utilized to plan the multichannel blend waveforms to complex spectrograms. Then, GOMF is used to estimates the obvious speech signals and noise. After the estimation, important features are extracted. Feature extraction is based on spatial feature, spectral feature, and directional features. To attain the enhanced classification outcomes, the spectrogram is reconstructed based on enthalpy-based deep neural network (EDNN). At last, convert the resultant speech spectrogram back to the extracted output signal based on inverse STFT. Experimental results show that our proposed approach accomplishes the most extreme SNR outcome of − 6dB of 24.0523. Comparable to the DNN-JAT, which achieves 18.50032. The RNN and NMF-DNN had the worst SNR 13.45434 and 12.29991. The proposed outcome is compared with various algorithms and some existing works. Compared with other existing works our proposed methodology achieves higher outcomes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  1. Wang, D., Chen, J.: Supervised speech separation based on deep learning: an overview. IEEE/ACM Trans. Audio Speech Lang. Process. 26(10), 1702–1726 (2018)

    Article  Google Scholar 

  2. Ding, Y., Xu, Y., Zhang, S., Cong, Y., Wang, L.: Self-Supervised Learning for Audio-Visual Speaker Diarization, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 4367–4371

  3. Bahmaninezhad, F., Jian W., Rongzhi G., Shi-Xiong Z., Yong X., Meng Y., Dong Y.: A comprehensive study of speech separation: spectrogram vs waveform separation. arXiv preprint arXiv: 1905.07497 (2019)

  4. Majumder, N., Hazarika, D., Gelbukh, A., Cambria, E., Poria, S.: Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowl.-Based Syst. 161, 124–133 (2018)

    Article  Google Scholar 

  5. Chakrabarty, S., Habets, E.A.: Multi-speaker doa estimation using deep convolutional networks trained with noise signals. IEEE J. Select. Topics Signal Process. 13(1), 8–21 (2019)

    Article  Google Scholar 

  6. Afouras, T., Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Deep audio-visual speech recognition, IEEE transactions on pattern analysis and machine intelligence, 2018

  7. Luo, Y., Mesgarani, N.: Conv-tasnet: surpassing ideal time frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)

    Article  Google Scholar 

  8. Luo, Y., Han, C., Mesgarani, N., Ceolini, E., Liu, S.C.: (2019). FaSNet: Low-latency adaptive beamforming for multi-microphone audio processing. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (pp. 260–267). IEEE

  9. Rongzhi G., Shi-Xiong Z.: Multi-modal Multi-channel Target Speech Separation. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2020

  10. Luo, Yi., Chen, Z., Mesgarani, N.: Speaker-independent speech separation with deep attractor network. IEEE/ACM Trans. Audio Speech Lang. Process. 26(4), 787–796 (2018)

    Article  Google Scholar 

  11. Fan, C., Jianhua T., Bin, L., Jiangyan Y., Zhengqi W.: Gated recurrent fusion of spatial and spectral features for multi-channel speech separation with deep embedding representations. In Proc. Interspeech, vol. 2020. 2020

  12. Shimada, K., Bando, Y., Mimura, M., Itoyama, K., Yoshii, K., Kawahara, T.: Unsupervised speech enhancement based on multichannel NMF-informed beamforming for noise-robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 27(5), 960–971 (2019)

    Article  Google Scholar 

  13. Li, X., Girin, L., Gannot, S., Horaud, R.: Multichannel speech separation and enhancement using the convolutive transfer function. IEEE/ACM Trans. Audio Speech Lang. Process. 27(3), 645–659 (2019)

    Article  Google Scholar 

  14. Gu, R., Shi-Xiong Z., Lianwu C., Yong X., Meng Y., Dan S., Yuexian Z., Dong Y.: Enhancing End-to-End Multi-Channel Speech Separation Via Spatial Feature Learning. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7319–7323. IEEE, 2020

  15. Luo, Yi., Mesgarani, N.: Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)

    Article  Google Scholar 

  16. Luo, Y., Zhuo C., Nima M., Takuya Yoshioka.: End-to-end microphone permutation and number invariant multi-channel speech separation. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6394–6398. IEEE, 2020.

  17. Gu, R., Shi-Xiong Z., Yong X., Lianwu C., Yuexian Z., Dong Y.: Multi-modal multi-channel target speech separation. IEEE J. Select. Topics Signal Process. (2020).

  18. Seki, S., Kameoka, H., Li, Li., Toda, T., Takeda, K.: Underdetermined source separation based on generalized multichannel variationalautoencoder. IEEE Access 7, 168104–168115 (2019)

    Article  Google Scholar 

  19. Yan, C., Biao G., Yuxuan W., Yue G. Deep multi-view enhancement hashing for image retrieval. IEEE Trans. Pattern Anal. Mach. Intell. (2020).

  20. Yan, C., Biyao S., Hao Z., Ruixin N., Yongdong Z., Feng X.: 3d room layout estimation from a single rgb image. IEEE Trans. Multimed. (2020).

  21. Yan, C., Zhisheng L., Yongbing Z., Yutao L., Xiangyang J., Yongdong Z.: Depth image denoising using nuclear norm and learning graph model. arXiv preprint:2008.03741(2020).

  22. Yicheng, D., Kouhei, S.: Semi-supervised Multichannel Speech Separation Based on a Phone- and Speaker-Aware Deep Generative Model of Speech Spectrograms, National Institute of Advanced Industrial Science and Technology, 2020.

  23. Ozerov, A., Fevotte, C.: Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation. IEEE Trans. Audio Speech Lang. Process. 18(3), 550–563 (2010)

    Article  Google Scholar 

  24. Saito, Y., Takamichi, S., Saruwatari, H.: Vocoder-free text-to-speech synthesis incorporating generative adversarial networks using low-/multi-frequency STFT amplitude spectra. Comput. Speech Lang. 1(58), 347–363 (2019)

    Article  Google Scholar 

  25. Mitsufuji, Y., Uhlich, S., Takamune, N., Kitamura, D., Koyama, S., Saruwatari, H.: Multichannel non-negative matrix factorization using banded spatial covariance matrices in wavenumber domain. IEEE Trans. Audio Speech Lang. Process. 28, 49–60 (2020)

    Article  Google Scholar 

  26. Wang, Z.-Q., Wang, P., Wang, D.: Complex spectral mapping for single- and multi-channel speech enhancement and robust ASR. IEEE Trans. Audio Speech Lang. Process. 28, 1778–1787 (2020)

    Article  Google Scholar 

  27. Yoshiki, M.T., Tatsuya K.: Consistency-aware multi-channel speech enhancement using deep neural networks. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 821–825. IEEE, 2020.

  28. Martinez, A.M., Gerlach, L., Payá-Vayá, G., Hermansky, H., Ooster, J., Meyer, B.T.: DNN-based performance measures for predicting error rates in automatic speech recognition and optimizing hearing aid parameters. Speech Commun. 1(106), 44–56 (2019)

    Article  Google Scholar 

  29. Ravi Kishore, T., Sai Sidharth, D.: Analysis of Linear and Non-Linear Frequency Modulated Signals Using STFT and Hough Transform, Signal Process. Inf. Technol., 2015.

  30. Ewees, A. A., Mohamed A.E., Essam H.H.: Improved grasshopper optimization algorithm using opposition-based learning. Expert Syst. Appl. 112 (2018): 156–172.

  31. Zhang, X., Wang, D.L.: Deep learning based binaural speech separation in reverberant environments. IEEE Trans. Audio Speech Lang. Process. 25(5), 1075–1084 (2017)

    Article  Google Scholar 

  32. Gu, R., Lianwu C., Shi-Xiong Z., Jimeng Z., Yong X., Meng Y., Dan S., Yuexian Z., Dong Y.: Neural Spatial Filter: Target Speaker Speech Separation Assisted with Directional Information. In Interspeech, pp. 4290–4294. 2019.

  33. Lozano-Diez, A., Zazo, R., Toledano, D.T., Gonzalez-Rodriguez, J.: An analysis of the influence of deep neural network (DNN) topology in bottleneck feature based language recognition. PLoS ONE 12(8), e0182580 (2017)

    Article  Google Scholar 

  34. Narayanan, A., Wang, D.: Improving robustness of deep neural network acoustic models via speech separation and joint adaptive training. IEEE Trans. Audio Speech Lang. Process. 23(1), 92–101 (2015)

    Google Scholar 

  35. Geiger, J. T., Weninger, F., Gemmeke, J., Wollmer, M., Schuller, B., Rigoll, G.: Memory-enhanced neural networks and NMF for robust ASR, IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, no. 6, pp. 1037–1046, Jun. 2014.

  36. Weng, S. W. C., Yu, D., Juang, B.-H.: Recurrent deep neural networks for robust speech recognition, in Proc. IEEE Int. Conf. Acoust, Speech, Signal Process., 2014, pp. 5532–5536.

  37. Weninger, F., Geiger, J., Wöllmer, M., Schuller, B., Rigoll, G.: The Munich feature enhancement approach to the 2nd CHiME challenge using BLSTM recurrent neural networks, in Proc. 2nd CHiME Workshop Mach. Listening Multisource Environ., 2013, pp. 86–90.

  38. Jin, Y.L., Chen J.T., QianHong L., Yan W.: Multi-Head Self-Attention Based Deep Clustering For Single-Channel Speech Separation. IEEE Access (2020).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yannam Vasantha Koteswararao.

Ethics declarations

Conflict of interest

Conflict of interest the authors declare that they have no conflict of interest.

Additional information

Communicated by Y. Zhang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Koteswararao, Y.V., Rao, C.B.R. Multichannel speech separation using hybrid GOMF and enthalpy-based deep neural networks. Multimedia Systems 27, 271–286 (2021). https://doi.org/10.1007/s00530-020-00740-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-020-00740-y

Keywords

Navigation