A Lightweight Framework for Online Voice Activity Detection in the Wild

Xu, Xuenan; Dinkel, Heinrich; Wu, Mengyue; Yu, Kai

doi:10.21437/Interspeech.2021-1977

A Lightweight Framework for Online Voice Activity Detection in the Wild

Xuenan Xu, Heinrich Dinkel, Mengyue Wu, Kai Yu

Voice activity detection (VAD) is an essential pre-processing component for speech-related tasks such as automatic speech recognition (ASR). Traditional VAD systems require strong frame-level supervision for training, inhibiting their performance in real-world test scenarios. Previously, the general-purpose VAD (GPVAD) framework has been proposed to enhance noise robustness significantly. However, GPVAD models are comparatively large and only work for offline evaluation. This work proposes the use of a knowledge distillation framework, where a (large, offline) teacher model provides frame-level supervision to a (light, online) student model. Our experiments verify that our proposed lightweight student models outperform GPVAD on all test sets, including clean, synthetic and real-world scenarios. Our smallest student model only uses 2.2% of the parameters and 15.9% duration cost of our teacher model for inference when evaluated on a Raspberry Pi.

doi: 10.21437/Interspeech.2021-1977

Cite as: Xu, X., Dinkel, H., Wu, M., Yu, K. (2021) A Lightweight Framework for Online Voice Activity Detection in the Wild. Proc. Interspeech 2021, 371-375, doi: 10.21437/Interspeech.2021-1977

@inproceedings{xu21b_interspeech,
  author={Xuenan Xu and Heinrich Dinkel and Mengyue Wu and Kai Yu},
  title={{A Lightweight Framework for Online Voice Activity Detection in the Wild}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={371--375},
  doi={10.21437/Interspeech.2021-1977}
}