ABSTRACT
In order to solve the problem that the popular monaural speech enhancement models that based on encoder-decoder do not make full use of full-scale features, a full-scale feature connected speech enhancement model FSC-SENet is proposed. Firstly, this paper constructs a speech enhancement model based on CRN architecture. Convolutional encoder and decoder are used to extract features and recover speech signals, and LSTM modules are used to extract temporal features at the bottleneck of the model. Then a full-scale connection method and multi feature dynamic fusion mechanism are proposed, so that the decoder can make full use of the full-scale features to recover clean speech in the decoding process. Experimental results on TIMIT corpus show that compared with CRN, our FSC-SENet improves PESQ score by 0.39 and STOI score by 2.8% under seen noise cases, and PESQ score by 0.43 and STOI score by 3.1% under unseen noise cases, which proves that the proposed full-scale connection and dynamic feature fusion mechanism can make CRN have better speech enhancement performance.
- Wang Y, Wang D. Towards scaling up classification-based speech separation [J]. IEEE Transactions on Audio, Speech, and Language Processing, 2013, 21(7): 1381--1390.Google ScholarDigital Library
- Xu Y, Du J, Dai L-R, et al. A regression approach to speech enhancement based on deep neural networks [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014, 23(1): 7--19.Google Scholar
- Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation [C]. MICCAI 2015, 2015: 234--241.Google Scholar
- Badrinarayanan V, Kendall A, Cipolla R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation [J]. IEEE transactions on pattern analysis and machine intelligence, 2017, 39(12): 2481--2495.Google Scholar
- Jansson A, Humphrey E, Montecchio N, et al. Singing voice separation with deep u-net convolutional networks [C]. 18th International Society for Music Information Retrieval Conference, 2017: 23--27.Google Scholar
- Stoller D, Ewert S, Dixon S. Wave-u-net: A multi-scale neural network for end-to-end audio source separation [C]. International Society for Music Information Retrieval (ISMIR) Conference 2018, 2018: 334--340.Google Scholar
- Soni M H, Shah N, Patil H A. Time-frequency masking-based speech enhancement using generative adversarial network [C]. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018: 5039--5043.Google Scholar
- Park S R, Lee J W. A fully convolutional neural network for speech enhancement [C]. Interspeech 2017, 2017: 1993--1997.Google ScholarCross Ref
- Tan K, Wang D. A convolutional recurrent neural network for real-time speech enhancement [C]. Interspeech, 2018: 3229--3233.Google Scholar
- Li A, Zheng C, Fan C, et al. A recursive network with dynamic attention for monaural speech enhancement [C]. Interspeech 2020, 2020: 2422--2426.Google Scholar
- Huang H, Lin L, Tong R, et al. Unet 3+: A full-scale connected unet for medical image segmentation [C]. ICASSP 2020--2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020: 1055--1059.Google Scholar
- Garofolo J S, Lamel L F, Fisher W M, et al. Darpa timit acoustic-phonetic continous speech corpus cd-rom. Nist speech disc 1--1.1 [J]. 1993, 93: 27403.Google ScholarCross Ref
- Hu G, Wang D. A tandem algorithm for pitch estimation and voiced speech segregation [J]. IEEE Transactions on Audio, Speech, and Language Processing, 2010, 18(8): 2067--2079.Google ScholarDigital Library
- Varga A, Steeneken H J. Assessment for automatic speech recognition: Ii. Noisex-92: A database and an experiment to study the effect of additive noise on speech recognition systems [J]. Speech communication, 1993, 12(3): 247--251.Google ScholarDigital Library
- Rix A W, Beerends J G, Hollier M P, et al. Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs [C]. 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing Proceedings (Cat No 01CH37221), 2001: 749--752.Google ScholarDigital Library
- Taal C H, Hendriks R C, Heusdens R, et al. An algorithm for intelligibility prediction of time-frequency weighted noisy speech [J]. IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19(7): 2125--2136.Google ScholarDigital Library
Index Terms
- Research on Speech Enhancement based on Full-scale Connection
Recommendations
Combined speech enhancement and auditory modelling for robust distributed speech recognition
The performance of automatic speech recognition (ASR) systems in the presence of noise is an area that has attracted a lot of research interest. Additive noise from interfering noise sources, and convolutional noise arising from transmission channel ...
Reconstruction-based speech enhancement from robust acoustic features
A method of speech enhancement that reconstructs clean speech from acoustic features.Features estimated by a statistical method incorporating noise and speaker adaptation.Listening tests find enhancement highly effective in reducing background noise. ...
Spectral-domain speech enhancement for speech recognition
Speech recognition performance deteriorates in face of unknown noise. Speech enhancement offers a solution by reducing the noise in speech at runtime. However, it also introduces artificial distortion to the speech signal. In this paper, we aim at ...
Comments