ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

Noise Robust Acoustic Modeling for Single-Channel Speech Recognition Based on a Stream-Wise Transformer Architecture

Masakiyo Fujimoto, Hisashi Kawai

This paper addresses a noise-robust automatic speech recognition (ASR) method under the constraints of real-time, one-pass, and single-channel processing. Under such strong constraints, single-channel speech enhancement becomes a key technology because methods with multiple-passes or batch processing, such as acoustic model adaptation, are not suitable for use. However, single-channel speech enhancement often degrades ASR performance due to speech distortion. To overcome this problem, we propose a noise robust acoustic modeling method based on the stream-wise transformer model. The proposed method accepts multi-stream features obtained by multiple single-channel speech enhancement methods as input and selectively uses an appropriate feature stream according to the noise environment by paying attention to the noteworthy stream on the basis of multi-head attention. The proposed method considers the attention for the stream direction instead of the time series direction, and it is thus capable of real-time and low-latency processing. Comparative evaluations reveal that the proposed method successfully improves the accuracy of ASR in noisy environments and reduces the number of model parameters even under strong constraints.


doi: 10.21437/Interspeech.2021-225

Cite as: Fujimoto, M., Kawai, H. (2021) Noise Robust Acoustic Modeling for Single-Channel Speech Recognition Based on a Stream-Wise Transformer Architecture. Proc. Interspeech 2021, 281-285, doi: 10.21437/Interspeech.2021-225

@inproceedings{fujimoto21_interspeech,
  author={Masakiyo Fujimoto and Hisashi Kawai},
  title={{Noise Robust Acoustic Modeling for Single-Channel Speech Recognition Based on a Stream-Wise Transformer Architecture}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={281--285},
  doi={10.21437/Interspeech.2021-225}
}