ISCA Archive Interspeech 2016
ISCA Archive Interspeech 2016

Joint Optimization of Denoising Autoencoder and DNN Acoustic Model Based on Multi-Target Learning for Noisy Speech Recognition

Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

Denoising autoencoders (DAEs) have been investigated for enhancing noisy speech before feeding it to the back-end deep neural network (DNN) acoustic model, but there may be a mismatch between the DAE output and the expected input of the back-end DNN, and also inconsistency between the training objective functions of the two networks. In this paper, a joint optimization method of the front-end DAE and the back-end DNN is proposed based on a multi-target learning scheme. In the first step, the front-end DAE is trained with an additional target of minimizing the errors propagated by the back-end DNN. Then, the unified network of DAE and DNN is fine-tuned for the phone state classification target, with an extra target of input speech enhancement imposed to the DAE part. The proposed method has been evaluated with the CHiME3 ASR task, and demonstrated to improve the baseline DNN as well as the simple coupling of DAE with DNN. The method is also effective as a post-filter of a beamformer.


doi: 10.21437/Interspeech.2016-388

Cite as: Mimura, M., Sakai, S., Kawahara, T. (2016) Joint Optimization of Denoising Autoencoder and DNN Acoustic Model Based on Multi-Target Learning for Noisy Speech Recognition. Proc. Interspeech 2016, 3803-3807, doi: 10.21437/Interspeech.2016-388

@inproceedings{mimura16_interspeech,
  author={Masato Mimura and Shinsuke Sakai and Tatsuya Kawahara},
  title={{Joint Optimization of Denoising Autoencoder and DNN Acoustic Model Based on Multi-Target Learning for Noisy Speech Recognition}},
  year=2016,
  booktitle={Proc. Interspeech 2016},
  pages={3803--3807},
  doi={10.21437/Interspeech.2016-388}
}