To improve the noise robustness of automatic speech recognition (ASR), the generative adversarial network (GAN) based enhancement methods are employed as the front-end processing, which comprise a single adversarial process of an enhancement model and a discriminator. In this single adversarial process, the discriminator is encouraged to find differences between the enhanced and clean speeches, but the distribution of clean speeches is ignored. In this paper, we propose a double adversarial network (DAN) by adding another adversarial generation process (AGP), which forces the discriminator not only to find the differences but also to model the distribution. Furthermore, a functional mean square error (f-MSE) is proposed to utilize the representations learned by the discriminator. Experimental results reveal that AGP and f-MSE are crucial for the enhancement performance on ASR task, which are missed in previous GAN-based methods. Specifically, our DAN achieves 13.00% relative word error rate improvements over the noisy speeches on the test set of CHiME-2, which outperforms several recent GAN-based enhancement methods significantly.
Cite as: Du, Z., Han, J., Zhang, X. (2020) Double Adversarial Network Based Monaural Speech Enhancement for Robust Speech Recognition. Proc. Interspeech 2020, 309-313, doi: 10.21437/Interspeech.2020-1504
@inproceedings{du20_interspeech, author={Zhihao Du and Jiqing Han and Xueliang Zhang}, title={{Double Adversarial Network Based Monaural Speech Enhancement for Robust Speech Recognition}}, year=2020, booktitle={Proc. Interspeech 2020}, pages={309--313}, doi={10.21437/Interspeech.2020-1504} }