Abstract:
Humans interpret speech through both acoustic and textual signals. Recognizing the importance of incorporating contextual information into speech processing, especially f...Show MoreMetadata
Abstract:
Humans interpret speech through both acoustic and textual signals. Recognizing the importance of incorporating contextual information into speech processing, especially for spoofing detection, we introduce a multimodal representation framework that integrates text with audio signals to enhance spoof detection effectiveness. This research advances in two significant ways: firstly, through the creation of a new multimodal spoof detection model called Raw-BERT, and secondly, by developing an innovative multi-headed multimodal attention network that synergistically merges text and audio data for enhanced detection capabilities. We have extensively evaluated the proposed framework across diverse datasets in various languages, demonstrating that the addition of textual context significantly boosts model performance. Overall, the proposed model demonstrates superior results, consistently outperforming existing spoof detection algorithms in multiple evaluations on benchmark audio spoofing datasets.
Date of Conference: 15-18 September 2024
Date Added to IEEE Xplore: 11 November 2024
ISBN Information: