Abstract
Under the condition of low signal-to-noise ratio, for the problem of insufficient speech feature extraction and speech enhancement effect of the traditional neural network, this paper is based on empirical mode decomposition (EMD), temporal convolutional network (TCN), and gated convolution recurrent neural network (GCRN), while combining with feature fusion module (FFM), the adaptive mean median-empirical mode decomposition-multilayer gated feature fusion module convolutional recurrent neural networks (ME-MGFCRNs) for speech enhancement modeling. The network model uses a split-frequency learning strategy to learn low-frequency features and high-frequency features, i.e., the TCN and MGFCRN networks are used to obtain low-frequency and high-frequency features, and FFM processes the two sets of features to achieve speech enhancement in the form of feature mapping. The model proposed in this paper performs ablation and comparison experiments on the dataset to evaluate the enhancement effect of speech using PESQ, FwSegSNR, and STOI metrics. The research shows that under different noise environments and SNR conditions, the model proposed in this paper improves compared with other baseline models, especially under the low SNR condition of − 5 dB, FwSegSNR and PESQ improve by more than 0.86 dB and 0.02 compared with other baseline models.





Similar content being viewed by others

Availability of Data and Materials
All the data included in this study are available upon request by contacting the corresponding author.
References
S.H. Bae, I. Choi, N.S. Kim, Disentangled feature learning for noise-invariant speech enhancement. Appl. Sci. 9(11), 2289 (2019)
C. Boeddeker, W. Zhang, T. Nakatani et al., Convolutive transfer function invariant SDR training criteria for multi-channel reverberant speech separation. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2021), pp. 8428–8432
S.F. Boll, Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 27(2), 113–120 (1979)
T. Bose, J. Schroeder, Adaptive mean/median filtering. In: 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 00CH37100), Istanbul, Turkey, 2000 (2000) pp. 3830–3833
H.S. Choi, J.H. Kim, J. Huh et al., Phase-aware speech enhancement with deep complex U-net. In: International Conference on Learning Representations (2019)
F. Dang, H. Chen, P. Zhang, DPT-FSNet: dual-path transformer based full-band and sub-band fusion network for speech enhancement. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1(1) (2022), pp.6857–6861
Y.N. Dauphin, A. Fan, M. Auli et al., Language modeling with gated convolutional networks. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70(1) (2017), pp. 933–941
Y. Ephraim, D. Malah, Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 32(6), 1109–1121 (1984)
K. Hu, P. Divenyi, D. Ellis, Z. Jin, B.Z. Shinn-Cunningham, D. Wang, Preliminary intelligibility tests of a monaural speech segregation system. In: Proceedings of Workshop on Statistical and Perceptual Audition. Brisbane (2008)
G. Huang, Z. Liu, L. Van Der Maaten, K.Q. Weinberger, Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), pp. 4700–4708
N. Ibtehaz, M.S. Rahman, MultiResUNet: rethinking the U-net architecture for multimodal biomedical image segmentation neural networks. Neural Netw. 121, 74–87 (2020)
C. Lea, M.D. Flynn, R. Vidal et al., Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 156–165
A. Li,M. Yuan, C. Zheng et al., Convolutional recurrent neural network based progressive learning for monaural speech enhancement. arXiv:1908.10768 (2019)
X.M. Li, C. Bao, M.S. Jia, A sinusoidal audio and speech analysis/synthesis model based on improved EMD by adding pure tone. IEEE Mach. Learn. Signal Process. 1(1), 1–5 (2011)
J.S. Lim, A.V. Oppenheim, Enhancement and bandwidth compression of noisy speech. Process. IEEE 67(12), 1586–1604 (1979)
Y. Luo, Z. Chen, T. Yoshioka, Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020), pp. 46–50
D. Michelsanti, Z.H. Tan, S.X. Zhang et al., An overview of deep-learning-based audio-visual speech enhancement and separation. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 1368–1396 (2021)
E.A. Mishra, A.K. Sharma, M. Bhalotia et al., A novel approach to analyse speech emotion using CNN and multilayer perceptron. In: 2022 2nd International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE), NanJing, China (2022), pp. 1157–1161
N. Mohammadiha, P. Smaragdis, A. Leijon, Supervised and unsupervised speech enhancement using nonnegative matrix factorization. IEEE Trans. Audio Speech Lang. Process. 21(10), 2140–2151 (2013)
V. Panayotov, G. Chen, D. Povey, S. Khudanpur, LibriSpeech: an ASR corpus based on public domain audio books. In: ICASSP 2015 (2015)
A. Pandey, D.L. Wang, Learning complex spectral mapping for speech enhancement with improved cross-corpus generalization. In: Interspeech (2020), pp. 4511–4515
D. Pearce, J. Picone, Aurora working group: DSR front end LVCSR evaluation AU/384/02, Institute for Signal and Information Processing, Mississippi State University, Technical Report (2002)
S. Qin, T. Jiang, S. Wu et al., Graph convolution based deep clustering for speech separation. IEEE Access. 8, 82571–82580 (2020)
C.K. Reddy, V. Gopal, R. Cutler et al., The INTERSPEECH 2020 deep noise suppression challenge: datasets, subjective testing framework, and challenge results. arXiv preprint arXiv:2005.13981 (2020)
A.W. Rix, J.G. Beerends, M.P. Hollier, A.P. Hekstra, Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (IEEE, 2001), pp. 749–752
N. Saleem, M.I. Khattak, Multi-scale decomposition based supervised single channel deep speech enhancement. Appl. Soft Comput. 95(4), 106666 (2020)
N. Saleem, M.I. Khattak, E.V. Perez, Spectral phase estimation based on deep neural networks for single channel speech enhancement. J. Commun. Technol. Electron. 64, 1372–1382 (2019)
Y. Shi, J. Bai, P. Xue, Acoustic and energy fusion feature extraction for noise robust speech recognition. IEEE Access. 7(1), 81911–81922 (2019)
C.H. Taal, R.C. Hendriks, R. Heusdens, J. Jensen, An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 19(7), 2125–2136 (2011)
K. Tan, J. Chen, D. Wang, Gated residual networks with dilated convolutions for monaural speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 27(1), 189–198 (2019)
K. Tan, D.L. Wang, A convolutional recurrent neural network for real-time speech enhancement. Interspeech 2018, 3229–3233 (2018)
K. Tan, D.L. Wang, Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 380–390 (2020)
A. Varga, H.J.M. Steeneken, Assessment for automatic speech recognition: II.NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12, 247–251 (1993)
S. Venkataramani, J. Casebeer, P. Smaragdis, End-to-end source separation with adaptive front-ends. In: 2018 52nd Asilomar Conference on Signals, Systems, and Computers (IEEE, 2018), pp. 684–688
Y. Wang, A. Narayanan, D.L. Wang, On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1849–1858 (2014)
N.L. Westhausen, B.T. Meyer, Dual-signal transformation LSTM network for real-time noise suppression. arXiv preprint arXiv:2005.07551 (2020)
B. Wiem, M. Messaoud, A. Bouzid, Phase-aware subspace decomposition for single channel speech separation. IET Signal Proc. 14(4), 214–222 (2020)
X. Xiang, X. Zhang, H. Chen, A nested u-net with self-attention and dense connectivity for monaural speech enhancement. IEEE Signal Process. Lett. 29, 105–109 (2021)
Z. Zhao, H. Liu, T. Fingscheidt, Convolutional neural networks to enhance coded speech. IEEE/ACM Trans. Audio Speech Lang. Process. 27(4), 663–678 (2019)
Acknowledgements
This research was received by the natural science foundation of Heilongjiang Province (No. LH2020F033), the national natural science youth foundation of China (No. 11804068) and research project of the Heilongjiang Province Health Commission (No. 20221111001069).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
The authors acknowledge that they have no competing and conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Lan, C., Chen, H., Zhang, L. et al. Research on Speech Enhancement Algorithm by Fusing Improved EMD and GCRN Networks. Circuits Syst Signal Process 43, 4588–4604 (2024). https://doi.org/10.1007/s00034-024-02677-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-024-02677-3