Abstract:
Voice phishing (vishing) is increasingly popular due to the development of speech synthesis technology. In particular, the use of deep learning to generate an arbitrary-c...Show MoreMetadata
Abstract:
Voice phishing (vishing) is increasingly popular due to the development of speech synthesis technology. In particular, the use of deep learning to generate an arbitrary-content audio clip simulating the victim’s voice makes it difficult not only for humans but also for automatic speaker verification (ASV) systems to distinguish. Countermeasure (CM) systems have been developed recently to help ASV combat synthetic speech. In this work, we propose BTS-E, a framework to evaluate the correlation between Breathing, Talking (speech), and Silence sounds in an audio clip, then use this information for deepfake detection tasks. We argue that natural human sounds, such as breathing, are hard to synthesize by Text-to-speech (TTS) system. We conducted a large-scale evaluation using ASVspoof 2019 and 2021 evaluation set to validate our hypothesis. The experiment results show the applicability of the breathing sound feature in detecting deepfake voices. In general, the proposed system significantly increases the performance of the classifier by up to 46%.
Published in: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Date of Conference: 04-10 June 2023
Date Added to IEEE Xplore: 05 May 2023
ISBN Information: