Voice activity detection (VAD) is an essential pre-processing component for speech-related tasks such as automatic speech recognition (ASR). Traditional VAD systems require strong frame-level supervision for training, inhibiting their performance in real-world test scenarios. Previously, the general-purpose VAD (GPVAD) framework has been proposed to enhance noise robustness significantly. However, GPVAD models are comparatively large and only work for offline evaluation. This work proposes the use of a knowledge distillation framework, where a (large, offline) teacher model provides frame-level supervision to a (light, online) student model. Our experiments verify that our proposed lightweight student models outperform GPVAD on all test sets, including clean, synthetic and real-world scenarios. Our smallest student model only uses 2.2% of the parameters and 15.9% duration cost of our teacher model for inference when evaluated on a Raspberry Pi.
Cite as: Xu, X., Dinkel, H., Wu, M., Yu, K. (2021) A Lightweight Framework for Online Voice Activity Detection in the Wild. Proc. Interspeech 2021, 371-375, doi: 10.21437/Interspeech.2021-1977
@inproceedings{xu21b_interspeech, author={Xuenan Xu and Heinrich Dinkel and Mengyue Wu and Kai Yu}, title={{A Lightweight Framework for Online Voice Activity Detection in the Wild}}, year=2021, booktitle={Proc. Interspeech 2021}, pages={371--375}, doi={10.21437/Interspeech.2021-1977} }