Enhancing Visual Wake Word Spotting with Pretrained Model and Feature Balance Scaling | IEEE Conference Publication | IEEE Xplore

Enhancing Visual Wake Word Spotting with Pretrained Model and Feature Balance Scaling

Publisher: IEEE

Abstract:

Wake word spotting mainly focus on audio modality or audio-visual multimodal exploration. The visual modality delivers stable outcomes under poor acoustic conditions, mak...View more

Abstract:

Wake word spotting mainly focus on audio modality or audio-visual multimodal exploration. The visual modality delivers stable outcomes under poor acoustic conditions, making visual wake word spotting an emerging and challenging task. However, challenges such as overfitting due to subject dependence and performance decrease from imbalances between positive and negative samples still exist in visual wake word spotting. This paper introduces an efficient and robust visual wake word spotting system. Notably, a pretrained visual lipreading sequence encoder is employed to extract more effective lip movement features, allowing the model to focus on lip movement patterns and prevent overfitting. Additionally, we propose feature balance scaling, which adjusts the feature value ranges of both positive and negative samples during training. This scaling method can be easily applied to wake word spotting, mitigating the impact of data imbalance and enhancing the classifier's ability to handle out-of-bound samples. Finally, our system achieved notable results in the final evaluation of the ChatCLR challenge's 1st task, attaining a false alarm rate of 0.110 and a false reject rate of 0.070, culminating in a final score of 0.180 and securing third place.
Date of Conference: 15-19 July 2024
Date Added to IEEE Xplore: 29 August 2024
ISBN Information:

ISSN Information:

Publisher: IEEE
Conference Location: Niagara Falls, ON, Canada

Funding Agency:


References

References is not available for this document.