Multichannel speech separation using hybrid GOMF and enthalpy-based deep neural networks

Koteswararao, Yannam Vasantha; Rao, C. B. Rama

doi:10.1007/s00530-020-00740-y

Multichannel speech separation using hybrid GOMF and enthalpy-based deep neural networks

Regular Paper
Published: 06 January 2021

Volume 27, pages 271–286, (2021)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Yannam Vasantha Koteswararao¹ &
C. B. Rama Rao¹

373 Accesses
6 Citations
Explore all metrics

Abstract

Speech signal is commonly debased by room reverberation and included noises in genuine climates. This paper focuses on disengaging objective speech signals in reverberant conditions from multichannel input signals. To overcome all the existing drawbacks, this work proposes an efficient technique like, multichannel speech signal separation using a new hybrid method that combines grasshopper optimization-based matrix factorization (GOMF) and enthalpy-based DNN (EDNN). To predict and remove the unwanted noise into the multichannel input signal, this paper presents a narrative classification framework in the manner of following steps namely, STFT, GOMF-based rank estimation, identify signal eigenvalues, noise removal, feature extraction well as classification. At first, STFT is utilized to plan the multichannel blend waveforms to complex spectrograms. Then, GOMF is used to estimates the obvious speech signals and noise. After the estimation, important features are extracted. Feature extraction is based on spatial feature, spectral feature, and directional features. To attain the enhanced classification outcomes, the spectrogram is reconstructed based on enthalpy-based deep neural network (EDNN). At last, convert the resultant speech spectrogram back to the extracted output signal based on inverse STFT. Experimental results show that our proposed approach accomplishes the most extreme SNR outcome of − 6dB of 24.0523. Comparable to the DNN-JAT, which achieves 18.50032. The RNN and NMF-DNN had the worst SNR 13.45434 and 12.29991. The proposed outcome is compared with various algorithms and some existing works. Compared with other existing works our proposed methodology achieves higher outcomes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multichannel KHMF for speech separation with enthalpy based DOA and score based CNN (SCNN)

Article 16 November 2022

Front-end technologies for robust ASR in reverberant environments—spectral enhancement-based dereverberation and auditory modulation filterbank features

Article Open access 05 August 2015

Performance analysis of neural network, NMF and statistical approaches for speech enhancement

Article 17 September 2020

References

Wang, D., Chen, J.: Supervised speech separation based on deep learning: an overview. IEEE/ACM Trans. Audio Speech Lang. Process. 26(10), 1702–1726 (2018)
Article Google Scholar
Ding, Y., Xu, Y., Zhang, S., Cong, Y., Wang, L.: Self-Supervised Learning for Audio-Visual Speaker Diarization, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 4367–4371
Bahmaninezhad, F., Jian W., Rongzhi G., Shi-Xiong Z., Yong X., Meng Y., Dong Y.: A comprehensive study of speech separation: spectrogram vs waveform separation. arXiv preprint arXiv: 1905.07497 (2019)
Majumder, N., Hazarika, D., Gelbukh, A., Cambria, E., Poria, S.: Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowl.-Based Syst. 161, 124–133 (2018)
Article Google Scholar
Chakrabarty, S., Habets, E.A.: Multi-speaker doa estimation using deep convolutional networks trained with noise signals. IEEE J. Select. Topics Signal Process. 13(1), 8–21 (2019)
Article Google Scholar
Afouras, T., Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Deep audio-visual speech recognition, IEEE transactions on pattern analysis and machine intelligence, 2018
Luo, Y., Mesgarani, N.: Conv-tasnet: surpassing ideal time frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)
Article Google Scholar
Luo, Y., Han, C., Mesgarani, N., Ceolini, E., Liu, S.C.: (2019). FaSNet: Low-latency adaptive beamforming for multi-microphone audio processing. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (pp. 260–267). IEEE
Rongzhi G., Shi-Xiong Z.: Multi-modal Multi-channel Target Speech Separation. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2020
Luo, Yi., Chen, Z., Mesgarani, N.: Speaker-independent speech separation with deep attractor network. IEEE/ACM Trans. Audio Speech Lang. Process. 26(4), 787–796 (2018)
Article Google Scholar
Fan, C., Jianhua T., Bin, L., Jiangyan Y., Zhengqi W.: Gated recurrent fusion of spatial and spectral features for multi-channel speech separation with deep embedding representations. In Proc. Interspeech, vol. 2020. 2020
Shimada, K., Bando, Y., Mimura, M., Itoyama, K., Yoshii, K., Kawahara, T.: Unsupervised speech enhancement based on multichannel NMF-informed beamforming for noise-robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 27(5), 960–971 (2019)
Article Google Scholar
Li, X., Girin, L., Gannot, S., Horaud, R.: Multichannel speech separation and enhancement using the convolutive transfer function. IEEE/ACM Trans. Audio Speech Lang. Process. 27(3), 645–659 (2019)
Article Google Scholar
Gu, R., Shi-Xiong Z., Lianwu C., Yong X., Meng Y., Dan S., Yuexian Z., Dong Y.: Enhancing End-to-End Multi-Channel Speech Separation Via Spatial Feature Learning. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7319–7323. IEEE, 2020
Luo, Yi., Mesgarani, N.: Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)
Article Google Scholar
Luo, Y., Zhuo C., Nima M., Takuya Yoshioka.: End-to-end microphone permutation and number invariant multi-channel speech separation. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6394–6398. IEEE, 2020.
Gu, R., Shi-Xiong Z., Yong X., Lianwu C., Yuexian Z., Dong Y.: Multi-modal multi-channel target speech separation. IEEE J. Select. Topics Signal Process. (2020).
Seki, S., Kameoka, H., Li, Li., Toda, T., Takeda, K.: Underdetermined source separation based on generalized multichannel variationalautoencoder. IEEE Access 7, 168104–168115 (2019)
Article Google Scholar
Yan, C., Biao G., Yuxuan W., Yue G. Deep multi-view enhancement hashing for image retrieval. IEEE Trans. Pattern Anal. Mach. Intell. (2020).
Yan, C., Biyao S., Hao Z., Ruixin N., Yongdong Z., Feng X.: 3d room layout estimation from a single rgb image. IEEE Trans. Multimed. (2020).
Yan, C., Zhisheng L., Yongbing Z., Yutao L., Xiangyang J., Yongdong Z.: Depth image denoising using nuclear norm and learning graph model. arXiv preprint:2008.03741(2020).
Yicheng, D., Kouhei, S.: Semi-supervised Multichannel Speech Separation Based on a Phone- and Speaker-Aware Deep Generative Model of Speech Spectrograms, National Institute of Advanced Industrial Science and Technology, 2020.
Ozerov, A., Fevotte, C.: Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation. IEEE Trans. Audio Speech Lang. Process. 18(3), 550–563 (2010)
Article Google Scholar
Saito, Y., Takamichi, S., Saruwatari, H.: Vocoder-free text-to-speech synthesis incorporating generative adversarial networks using low-/multi-frequency STFT amplitude spectra. Comput. Speech Lang. 1(58), 347–363 (2019)
Article Google Scholar
Mitsufuji, Y., Uhlich, S., Takamune, N., Kitamura, D., Koyama, S., Saruwatari, H.: Multichannel non-negative matrix factorization using banded spatial covariance matrices in wavenumber domain. IEEE Trans. Audio Speech Lang. Process. 28, 49–60 (2020)
Article Google Scholar
Wang, Z.-Q., Wang, P., Wang, D.: Complex spectral mapping for single- and multi-channel speech enhancement and robust ASR. IEEE Trans. Audio Speech Lang. Process. 28, 1778–1787 (2020)
Article Google Scholar
Yoshiki, M.T., Tatsuya K.: Consistency-aware multi-channel speech enhancement using deep neural networks. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 821–825. IEEE, 2020.
Martinez, A.M., Gerlach, L., Payá-Vayá, G., Hermansky, H., Ooster, J., Meyer, B.T.: DNN-based performance measures for predicting error rates in automatic speech recognition and optimizing hearing aid parameters. Speech Commun. 1(106), 44–56 (2019)
Article Google Scholar
Ravi Kishore, T., Sai Sidharth, D.: Analysis of Linear and Non-Linear Frequency Modulated Signals Using STFT and Hough Transform, Signal Process. Inf. Technol., 2015.
Ewees, A. A., Mohamed A.E., Essam H.H.: Improved grasshopper optimization algorithm using opposition-based learning. Expert Syst. Appl. 112 (2018): 156–172.
Zhang, X., Wang, D.L.: Deep learning based binaural speech separation in reverberant environments. IEEE Trans. Audio Speech Lang. Process. 25(5), 1075–1084 (2017)
Article Google Scholar
Gu, R., Lianwu C., Shi-Xiong Z., Jimeng Z., Yong X., Meng Y., Dan S., Yuexian Z., Dong Y.: Neural Spatial Filter: Target Speaker Speech Separation Assisted with Directional Information. In Interspeech, pp. 4290–4294. 2019.
Lozano-Diez, A., Zazo, R., Toledano, D.T., Gonzalez-Rodriguez, J.: An analysis of the influence of deep neural network (DNN) topology in bottleneck feature based language recognition. PLoS ONE 12(8), e0182580 (2017)
Article Google Scholar
Narayanan, A., Wang, D.: Improving robustness of deep neural network acoustic models via speech separation and joint adaptive training. IEEE Trans. Audio Speech Lang. Process. 23(1), 92–101 (2015)
Google Scholar
Geiger, J. T., Weninger, F., Gemmeke, J., Wollmer, M., Schuller, B., Rigoll, G.: Memory-enhanced neural networks and NMF for robust ASR, IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, no. 6, pp. 1037–1046, Jun. 2014.
Weng, S. W. C., Yu, D., Juang, B.-H.: Recurrent deep neural networks for robust speech recognition, in Proc. IEEE Int. Conf. Acoust, Speech, Signal Process., 2014, pp. 5532–5536.
Weninger, F., Geiger, J., Wöllmer, M., Schuller, B., Rigoll, G.: The Munich feature enhancement approach to the 2nd CHiME challenge using BLSTM recurrent neural networks, in Proc. 2nd CHiME Workshop Mach. Listening Multisource Environ., 2013, pp. 86–90.
Jin, Y.L., Chen J.T., QianHong L., Yan W.: Multi-Head Self-Attention Based Deep Clustering For Single-Channel Speech Separation. IEEE Access (2020).

Download references

Author information

Authors and Affiliations

Department of ECE, National Institute of Technology, Warangal, Telangana, 506004, India
Yannam Vasantha Koteswararao & C. B. Rama Rao

Authors

Yannam Vasantha Koteswararao
View author publications
You can also search for this author in PubMed Google Scholar
C. B. Rama Rao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yannam Vasantha Koteswararao.

Ethics declarations

Conflict of interest

Conflict of interest the authors declare that they have no conflict of interest.

Additional information

Communicated by Y. Zhang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Koteswararao, Y.V., Rao, C.B.R. Multichannel speech separation using hybrid GOMF and enthalpy-based deep neural networks. Multimedia Systems 27, 271–286 (2021). https://doi.org/10.1007/s00530-020-00740-y

Download citation

Received: 14 October 2020
Accepted: 16 December 2020
Published: 06 January 2021
Issue Date: April 2021
DOI: https://doi.org/10.1007/s00530-020-00740-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multichannel speech separation using hybrid GOMF and enthalpy-based deep neural networks

Abstract

Access this article

Similar content being viewed by others

Multichannel KHMF for speech separation with enthalpy based DOA and score based CNN (SCNN)

Front-end technologies for robust ASR in reverberant environments—spectral enhancement-based dereverberation and auditory modulation filterbank features

Performance analysis of neural network, NMF and statistical approaches for speech enhancement

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multichannel speech separation using hybrid GOMF and enthalpy-based deep neural networks

Abstract

Access this article

Similar content being viewed by others

Multichannel KHMF for speech separation with enthalpy based DOA and score based CNN (SCNN)

Front-end technologies for robust ASR in reverberant environments—spectral enhancement-based dereverberation and auditory modulation filterbank features

Performance analysis of neural network, NMF and statistical approaches for speech enhancement

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation