Noise-Tolerant Self-Supervised Learning for Audio-Visual Voice Activity Detection

Kim, Ui-Hyun

doi:10.21437/Interspeech.2021-43

Noise-Tolerant Self-Supervised Learning for Audio-Visual Voice Activity Detection

Ui-Hyun Kim

Recent audio-visual voice activity detectors based on supervised learning require large amounts of labeled training data with manual mouth-region cropping in videos, and the performance is sensitive to a mismatch between the training and testing noise conditions. This paper introduces contrastive self-supervised learning for audio-visual voice activity detection as a possible solution to such problems. In addition, a novel self-supervised learning framework is proposed to improve overall training efficiency and testing performance on noise-corrupted datasets, as in real-world scenarios. This framework includes a branched audio encoder and a noise-tolerant loss function to cope with the uncertainty of speech and noise feature separation in a self-supervised manner. Experimental results, particularly under mismatched noise conditions, demonstrate the improved performance compared with a self-supervised learning baseline and a supervised learning framework.

doi: 10.21437/Interspeech.2021-43

Cite as: Kim, U.-H. (2021) Noise-Tolerant Self-Supervised Learning for Audio-Visual Voice Activity Detection. Proc. Interspeech 2021, 326-330, doi: 10.21437/Interspeech.2021-43

@inproceedings{kim21b_interspeech,
  author={Ui-Hyun Kim},
  title={{Noise-Tolerant Self-Supervised Learning for Audio-Visual Voice Activity Detection}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={326--330},
  doi={10.21437/Interspeech.2021-43}
}