ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

Voice Activity Detection for Live Speech of Baseball Game Based on Tandem Connection with Speech/Noise Separation Model

Yuto Nonaka, Chee Siang Leow, Akio Kobayashi, Takehito Utsuro, Hiromitsu Nishizaki

When applying voice activity detection (VAD) to a noisy sound, in general, noise reduction (speech separation) and VAD are performed separately. In this case, the noise reduction may suppress the speech, and the VAD may not work well for the speech after the noise reduction. This study proposes a VAD model through the tandem connection of neural network-based noise separation and a VAD model. By training the two models simultaneously, the noise separation model is expected to be trained to consider the VAD results, and thus effective noise separation can be achieved. Moreover, the improved speech/noise separation model will improve the accuracy of the VAD model. In this research, we deal with real-live speeches from baseball games, which have a very poor signal-to-noise ratio. The VAD experiments showed that the VAD performance at the frame level achieved 4.2 points improvement in F1-score by tandemly connecting the speech/noise separation model and the VAD model.


doi: 10.21437/Interspeech.2021-792

Cite as: Nonaka, Y., Leow, C.S., Kobayashi, A., Utsuro, T., Nishizaki, H. (2021) Voice Activity Detection for Live Speech of Baseball Game Based on Tandem Connection with Speech/Noise Separation Model. Proc. Interspeech 2021, 351-355, doi: 10.21437/Interspeech.2021-792

@inproceedings{nonaka21_interspeech,
  author={Yuto Nonaka and Chee Siang Leow and Akio Kobayashi and Takehito Utsuro and Hiromitsu Nishizaki},
  title={{Voice Activity Detection for Live Speech of Baseball Game Based on Tandem Connection with Speech/Noise Separation Model}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={351--355},
  doi={10.21437/Interspeech.2021-792}
}