Research on Speech Enhancement Algorithm by Fusing Improved EMD and GCRN Networks

Lan, Chaofeng; Chen, Huan; Zhang, Lei; Zhao, Shilong; Guo, Rui; Fan, Zixu

doi:10.1007/s00034-024-02677-3

Research on Speech Enhancement Algorithm by Fusing Improved EMD and GCRN Networks

Published: 30 April 2024

Volume 43, pages 4588–4604, (2024)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Chaofeng Lan¹^na1,
Huan Chen²,
Lei Zhang ORCID: orcid.org/0000-0002-3594-0778³,
Shilong Zhao¹,
Rui Guo¹ &
…
Zixu Fan¹^na1

204 Accesses
Explore all metrics

Abstract

Under the condition of low signal-to-noise ratio, for the problem of insufficient speech feature extraction and speech enhancement effect of the traditional neural network, this paper is based on empirical mode decomposition (EMD), temporal convolutional network (TCN), and gated convolution recurrent neural network (GCRN), while combining with feature fusion module (FFM), the adaptive mean median-empirical mode decomposition-multilayer gated feature fusion module convolutional recurrent neural networks (ME-MGFCRNs) for speech enhancement modeling. The network model uses a split-frequency learning strategy to learn low-frequency features and high-frequency features, i.e., the TCN and MGFCRN networks are used to obtain low-frequency and high-frequency features, and FFM processes the two sets of features to achieve speech enhancement in the form of feature mapping. The model proposed in this paper performs ablation and comparison experiments on the dataset to evaluate the enhancement effect of speech using PESQ, FwSegSNR, and STOI metrics. The research shows that under different noise environments and SNR conditions, the model proposed in this paper improves compared with other baseline models, especially under the low SNR condition of − 5 dB, FwSegSNR and PESQ improve by more than 0.86 dB and 0.02 compared with other baseline models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 3

Speech enhancement using deep complex convolutional neural network (DCCNN) model

Article 14 August 2024

Speech enhancement from fused features based on deep neural network and gated recurrent unit network

Article Open access 24 October 2021

A Study on Effectiveness of Deep Neural Networks for Speech Signal Enhancement in Comparison with Wiener Filtering Technique

Availability of Data and Materials

All the data included in this study are available upon request by contacting the corresponding author.

References

S.H. Bae, I. Choi, N.S. Kim, Disentangled feature learning for noise-invariant speech enhancement. Appl. Sci. 9(11), 2289 (2019)
Article Google Scholar
C. Boeddeker, W. Zhang, T. Nakatani et al., Convolutive transfer function invariant SDR training criteria for multi-channel reverberant speech separation. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2021), pp. 8428–8432
S.F. Boll, Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 27(2), 113–120 (1979)
Article Google Scholar
T. Bose, J. Schroeder, Adaptive mean/median filtering. In: 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 00CH37100), Istanbul, Turkey, 2000 (2000) pp. 3830–3833
H.S. Choi, J.H. Kim, J. Huh et al., Phase-aware speech enhancement with deep complex U-net. In: International Conference on Learning Representations (2019)
F. Dang, H. Chen, P. Zhang, DPT-FSNet: dual-path transformer based full-band and sub-band fusion network for speech enhancement. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1(1) (2022), pp.6857–6861
Y.N. Dauphin, A. Fan, M. Auli et al., Language modeling with gated convolutional networks. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70(1) (2017), pp. 933–941
Y. Ephraim, D. Malah, Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 32(6), 1109–1121 (1984)
Article Google Scholar
K. Hu, P. Divenyi, D. Ellis, Z. Jin, B.Z. Shinn-Cunningham, D. Wang, Preliminary intelligibility tests of a monaural speech segregation system. In: Proceedings of Workshop on Statistical and Perceptual Audition. Brisbane (2008)
G. Huang, Z. Liu, L. Van Der Maaten, K.Q. Weinberger, Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), pp. 4700–4708
N. Ibtehaz, M.S. Rahman, MultiResUNet: rethinking the U-net architecture for multimodal biomedical image segmentation neural networks. Neural Netw. 121, 74–87 (2020)
Article Google Scholar
C. Lea, M.D. Flynn, R. Vidal et al., Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 156–165
A. Li,M. Yuan, C. Zheng et al., Convolutional recurrent neural network based progressive learning for monaural speech enhancement. arXiv:1908.10768 (2019)
X.M. Li, C. Bao, M.S. Jia, A sinusoidal audio and speech analysis/synthesis model based on improved EMD by adding pure tone. IEEE Mach. Learn. Signal Process. 1(1), 1–5 (2011)
Google Scholar
J.S. Lim, A.V. Oppenheim, Enhancement and bandwidth compression of noisy speech. Process. IEEE 67(12), 1586–1604 (1979)
Article Google Scholar
Y. Luo, Z. Chen, T. Yoshioka, Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020), pp. 46–50
D. Michelsanti, Z.H. Tan, S.X. Zhang et al., An overview of deep-learning-based audio-visual speech enhancement and separation. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 1368–1396 (2021)
Article Google Scholar
E.A. Mishra, A.K. Sharma, M. Bhalotia et al., A novel approach to analyse speech emotion using CNN and multilayer perceptron. In: 2022 2nd International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE), NanJing, China (2022), pp. 1157–1161
N. Mohammadiha, P. Smaragdis, A. Leijon, Supervised and unsupervised speech enhancement using nonnegative matrix factorization. IEEE Trans. Audio Speech Lang. Process. 21(10), 2140–2151 (2013)
Article Google Scholar
V. Panayotov, G. Chen, D. Povey, S. Khudanpur, LibriSpeech: an ASR corpus based on public domain audio books. In: ICASSP 2015 (2015)
A. Pandey, D.L. Wang, Learning complex spectral mapping for speech enhancement with improved cross-corpus generalization. In: Interspeech (2020), pp. 4511–4515
D. Pearce, J. Picone, Aurora working group: DSR front end LVCSR evaluation AU/384/02, Institute for Signal and Information Processing, Mississippi State University, Technical Report (2002)
S. Qin, T. Jiang, S. Wu et al., Graph convolution based deep clustering for speech separation. IEEE Access. 8, 82571–82580 (2020)
Article Google Scholar
C.K. Reddy, V. Gopal, R. Cutler et al., The INTERSPEECH 2020 deep noise suppression challenge: datasets, subjective testing framework, and challenge results. arXiv preprint arXiv:2005.13981 (2020)
A.W. Rix, J.G. Beerends, M.P. Hollier, A.P. Hekstra, Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (IEEE, 2001), pp. 749–752
N. Saleem, M.I. Khattak, Multi-scale decomposition based supervised single channel deep speech enhancement. Appl. Soft Comput. 95(4), 106666 (2020)
Article Google Scholar
N. Saleem, M.I. Khattak, E.V. Perez, Spectral phase estimation based on deep neural networks for single channel speech enhancement. J. Commun. Technol. Electron. 64, 1372–1382 (2019)
Article Google Scholar
Y. Shi, J. Bai, P. Xue, Acoustic and energy fusion feature extraction for noise robust speech recognition. IEEE Access. 7(1), 81911–81922 (2019)
Article Google Scholar
C.H. Taal, R.C. Hendriks, R. Heusdens, J. Jensen, An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 19(7), 2125–2136 (2011)
Article Google Scholar
K. Tan, J. Chen, D. Wang, Gated residual networks with dilated convolutions for monaural speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 27(1), 189–198 (2019)
Article Google Scholar
K. Tan, D.L. Wang, A convolutional recurrent neural network for real-time speech enhancement. Interspeech 2018, 3229–3233 (2018)
Google Scholar
K. Tan, D.L. Wang, Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 380–390 (2020)
Article Google Scholar
A. Varga, H.J.M. Steeneken, Assessment for automatic speech recognition: II.NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12, 247–251 (1993)
Article Google Scholar
S. Venkataramani, J. Casebeer, P. Smaragdis, End-to-end source separation with adaptive front-ends. In: 2018 52nd Asilomar Conference on Signals, Systems, and Computers (IEEE, 2018), pp. 684–688
Y. Wang, A. Narayanan, D.L. Wang, On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1849–1858 (2014)
Article Google Scholar
N.L. Westhausen, B.T. Meyer, Dual-signal transformation LSTM network for real-time noise suppression. arXiv preprint arXiv:2005.07551 (2020)
B. Wiem, M. Messaoud, A. Bouzid, Phase-aware subspace decomposition for single channel speech separation. IET Signal Proc. 14(4), 214–222 (2020)
Article Google Scholar
X. Xiang, X. Zhang, H. Chen, A nested u-net with self-attention and dense connectivity for monaural speech enhancement. IEEE Signal Process. Lett. 29, 105–109 (2021)
Article Google Scholar
Z. Zhao, H. Liu, T. Fingscheidt, Convolutional neural networks to enhance coded speech. IEEE/ACM Trans. Audio Speech Lang. Process. 27(4), 663–678 (2019)
Article Google Scholar

Download references

Acknowledgements

This research was received by the natural science foundation of Heilongjiang Province (No. LH2020F033), the national natural science youth foundation of China (No. 11804068) and research project of the Heilongjiang Province Health Commission (No. 20221111001069).

Author information

Chaofeng Lan and Zixu Fan have contributed equally to this work.

Authors and Affiliations

School of Measurement and Communication Engineering, Harbin University of Science and Technology, Harbin, 150080, China
Chaofeng Lan, Shilong Zhao, Rui Guo & Zixu Fan
China Ship Development and Design Center, Wuhan, 430064, China
Huan Chen
Beidahuang Industry Group General Hospital, Harbin, 150088, China
Lei Zhang

Authors

Chaofeng Lan
View author publications
You can also search for this author inPubMed Google Scholar
Huan Chen
View author publications
You can also search for this author inPubMed Google Scholar
Lei Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Shilong Zhao
View author publications
You can also search for this author inPubMed Google Scholar
Rui Guo
View author publications
You can also search for this author inPubMed Google Scholar
Zixu Fan
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding authors

Correspondence to Lei Zhang or Shilong Zhao.

Ethics declarations

Conflict of interest

The authors acknowledge that they have no competing and conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Lan, C., Chen, H., Zhang, L. et al. Research on Speech Enhancement Algorithm by Fusing Improved EMD and GCRN Networks. Circuits Syst Signal Process 43, 4588–4604 (2024). https://doi.org/10.1007/s00034-024-02677-3

Download citation

Received: 28 April 2023
Revised: 20 March 2024
Accepted: 21 March 2024
Published: 30 April 2024
Issue Date: July 2024
DOI: https://doi.org/10.1007/s00034-024-02677-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Research on Speech Enhancement Algorithm by Fusing Improved EMD and GCRN Networks

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Speech enhancement using deep complex convolutional neural network (DCCNN) model

Speech enhancement from fused features based on deep neural network and gated recurrent unit network

A Study on Effectiveness of Deep Neural Networks for Speech Signal Enhancement in Comparison with Wiener Filtering Technique

Availability of Data and Materials

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now