skip to main content
10.1145/3664647.3681345acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Audio Deepfake Detection with Self-Supervised XLS-R and SLS Classifier

Published: 28 October 2024 Publication History

Abstract

Generative AI technologies, including text-to-speech (TTS) and voice conversion (VC), frequently become indistinguishable from genuine samples, posing challenges for individuals in discerning between real and synthetic content. This indistinguishability undermines trust in media, and the arbitrary cloning of personal voice signals presents significant challenges to privacy and security. In the field of deepfake audio detection, the majority of models achieving higher detection accuracy currently employ self-supervised pre-trained models. However, with the ongoing development of deepfake audio generation algorithms, maintaining high discrimination accuracy against new algorithms grows more challenging. To enhance the sensitivity of deepfake audio features, we propose a deepfake audio detection model that incorporates an SLS (Sensitive Layer Selection) module. Specifically, utilizing the pre-trained XLS-R enables our model to extract diverse audio features from its various layers, each providing distinct discriminative information. Utilizing the SLS classifier, our model captures sensitive contextual information across different layer levels of audio features, effectively employing this information for fake audio detection. Experimental results show that our method achieves state-of-the-art (SOTA) performance on both the ASVspoof 2021 DF and In-the-Wild datasets, with a specific Equal Error Rate (EER) of 1.92% on the ASVspoof 2021 DF dataset and 7.46% on the In-the-Wild dataset. Codes and data can be found at https://github.com/QiShanZhang/SLSforADD.

References

[1]
Yogesh Kumar, Apeksha Koul, and Chamkaur Singh. A deep learning approaches in text-to-speech system: a systematic review and recent research perspective. Multimedia Tools and Applications, 82(10):15171--15197, 2023.
[2]
Wen-Chin Huang, Lester Phillip Violeta, Songxiang Liu, Jiatong Shi, and Tomoki Toda. The singing voice conversion challenge 2023. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1--8. IEEE, 2023.
[3]
Jiangyan Yi, ChenglongWang, Jianhua Tao, Xiaohui Zhang, Chu Yuan Zhang, and Yan Zhao. Audio deepfake detection: A survey. arXiv preprint arXiv:2308.14970, 2023.
[4]
Forbes. A voice deepfake was used to scam a ceo out of $243,000. https://www.forbes.com/sites/jessedamiani/2019/09/03/a-voice-deepfake-was- used-to-scam-a-ceo-out-of-243000, 2021. Accessed on 08/16/2021.
[5]
Z Khamar Anjum and R Kumara Swamy. Spoofing and countermeasures for speaker verification: A review. In 2017 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET), pages 467--471. IEEE, 2017.
[6]
ZhizhengWu, Tomi Kinnunen, Nicholas Evans, and Junichi Yamagishi. Asvspoof 2015: Automatic speaker verification spoofing and countermeasures challenge evaluation plan. Training, 10(15):3750, 2014.
[7]
Xuechen Liu, Xin Wang, Md Sahidullah, Jose Patino, Héctor Delgado, Tomi Kinnunen, Massimiliano Todisco, Junichi Yamagishi, Nicholas Evans, Andreas Nautsch, et al. Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
[8]
Jiangyan Yi, Ruibo Fu, Jianhua Tao, Shuai Nie, Haoxin Ma, ChenglongWang, Tao Wang, Zhengkun Tian, Ye Bai, Cunhang Fan, et al. Add 2022: the first audio deep synthesis detection challenge. In ICASSP 2022--2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 9216--9220. IEEE, 2022.
[9]
Jiangyan Yi, Jianhua Tao, Ruibo Fu, Xinrui Yan, Chenglong Wang, Tao Wang, Chu Yuan Zhang, Xiaohui Zhang, Yan Zhao, Yong Ren, et al. Add 2023: the second audio deepfake detection challenge. arXiv preprint arXiv:2305.13774, 2023.
[10]
Massimiliano Todisco, Héctor Delgado, Kong Aik Lee, Md Sahidullah, Nicholas Evans, Tomi Kinnunen, and Junichi Yamagishi. Integrated presentation attack detection and automatic speaker verification: Common features and gaussian back-end fusion. In Interspeech 2018--19th Annual Conference of the International Speech Communication Association. ISCA, 2018.
[11]
Lian-Wu Chen, Wu Guo, and Li-Rong Dai. Speaker verification against synthetic speech. In 2010 7th International Symposium on Chinese Spoken Language Processing, pages 309--312. IEEE, 2010.
[12]
Massimiliano Todisco, Héctor Delgado, and Nicholas WD Evans. A new feature for automatic speaker verification anti-spoofing: Constant q cepstral coefficients. In Odyssey, volume 2016, pages 283--290, 2016.
[13]
Mirco Ravanelli and Yoshua Bengio. Speaker recognition from raw waveform with sincnet. In 2018 IEEE spoken language technology workshop (SLT), pages 1021--1028. IEEE, 2018.
[14]
Quchen Fu, Zhongwei Teng, Jules White, Maria E Powell, and Douglas C Schmidt. Fastaudio: A learnable audio front-end for spoof speech detection. In ICASSP 2022--2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3693--3697. IEEE, 2022.
[15]
Yang Xie, Zhenchuan Zhang, and Yingchun Yang. Siamese network with wav2vec feature for spoofing speech detection. In Interspeech, pages 4269--4273, 2021.
[16]
Hemlata Tak, Massimiliano Todisco, XinWang, Jee-weon Jung, Junichi Yamagishi, and Nicholas Evans. Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation. In The Speaker and Language Recognition Workshop, 2022.
[17]
Zhiqiang Lv, Shanshan Zhang, Kai Tang, and Pengfei Hu. Fake audio detection based on unsupervised pretraining models. In ICASSP 2022--2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 9231--9235. IEEE, 2022.
[18]
Xin Wang and Junichi Yamagishi. Investigating self-supervised front ends for speech spoofing countermeasures. In The Speaker and Language Recognition Workshop (Odyssey 2022). ISCA, 2022.
[19]
Sanyuan Chen, ChengyiWang, Zhengyang Chen, YuWu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505--1518, 2022.
[20]
Alec Radford, JongWook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492--28518. PMLR, 2023.
[21]
Yinlin Guo, Haofan Huang, Xi Chen, He Zhao, and YuehaiWang. Audio deepfake detection with self-supervised wavlm and multi-fusion attentive classifier. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 12702--12706, 2024.
[22]
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449--12460, 2020.
[23]
Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, and Michael Auli. XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale. In Proc. Interspeech 2022, pages 2278-- 2282, 2022.
[24]
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451--3460, 2021.
[25]
Massimiliano Todisco, Xin Wang, Ville Vestman, Md Sahidullah, Héctor Delgado, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Tomi H Kinnunen, and Kong Aik Lee. Asvspoof 2019: Future horizons in spoofed and fake audio detection. Interspeech 2019, 2019.
[26]
JuanMMartín-Doñas and Aitor Álvarez. The vicomtech audio deepfake detection system based on wav2vec2 for the 2022 add challenge. In ICASSP 2022--2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 9241--9245. IEEE, 2022.
[27]
Hemlata Tak, Jose Patino, Massimiliano Todisco, Andreas Nautsch, Nicholas Evans, and Anthony Larcher. End-to-end anti-spoofing with rawnet2. In ICASSP 2021--2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6369--6373. IEEE, 2021.
[28]
Jee-weon Jung, Hee-Soo Heo, Hemlata Tak, Hye-jin Shim, Joon Son Chung, Bong-Jin Lee, Ha-Jin Yu, and Nicholas Evans. Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks. In ICASSP 2022--2022 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 6367--6371. IEEE, 2022.
[29]
Chenglong Wang, Jiangyan Yi, Jianhua Tao, Chu Yuan Zhang, Shuai Zhang, and Xun Chen. Detection of Cross-Dataset Fake Audio Based on Prosodic and Pronunciation Features. In Proc. INTERSPEECH 2023, pages 3844--3848, 2023.
[30]
Yujie Yang, Haochen Qin, Hang Zhou, Chengcheng Wang, Tianyu Guo, Kai Han, and Yunhe Wang. A robust audio deepfake detection system via multi-view feature. In ICASSP 2024--2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 13131--13135. IEEE, 2024.
[31]
Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132--7141, 2018.
[32]
Xin Wang, Junichi Yamagishi, Massimiliano Todisco, Héctor Delgado, Andreas Nautsch, Nicholas Evans, Md Sahidullah, Ville Vestman, Tomi Kinnunen, Kong Aik Lee, et al. Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech. Computer Speech & Language, 64:101114, 2020.
[33]
Junichi Yamagishi, Xin Wang, Massimiliano Todisco, Md Sahidullah, Jose Patino, Andreas Nautsch, Xuechen Liu, Kong Aik Lee, Tomi Kinnunen, Nicholas Evans, et al. Asvspoof 2021: accelerating progress in spoofed and deepfake speech detection. In ASVspoof 2021Workshop-Automatic Speaker Verification and Spoofing Coutermeasures Challenge, 2021.
[34]
Nicolas Müller, Pavel Czempin, Franziska Diekmann, Adam Froghyar, and Konstantin Böttinger. Does audio deepfake detection generalize? Interspeech 2022, 2022.
[35]
Niko Brümmer and Edward De Villiers. The bosaris toolkit: Theory, algorithms and code for surviving the new dcf. arXiv preprint arXiv:1304.2865, 2013.
[36]
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436--444, 2015.
[37]
Hemlata Tak, Madhu Kamble, Jose Patino, Massimiliano Todisco, and Nicholas Evans. Rawboost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing. In ICASSP 2022--2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6382--6386. IEEE, 2022.
[38]
Chenglong Wang, Jiayi He, Jiangyan Yi, Jianhua Tao, Chu Yuan Zhang, and Xiaohui Zhang. Multi-scale permutation entropy for audio deepfake detection. In ICASSP 2024--2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1406--1410. IEEE, 2024.
[39]
Alessandro Pianese, Davide Cozzolino, Giovanni Poggi, and Luisa Verdoliva. Deepfake audio detection by speaker verification. In 2022 IEEE International Workshop on Information Forensics and Security (WIFS), pages 1--6. IEEE, 2022.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
October 2024
11719 pages
ISBN:9798400706868
DOI:10.1145/3664647
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. aigc
  2. anti spoofing
  3. audio deepfake detection
  4. countermeasures
  5. text to speech
  6. voice conversion

Qualifiers

  • Research-article

Funding Sources

  • Natural Science Foundation of Hubei Province of China
  • Training program of high level scientific research achievements of Hubei Minzu University

Conference

MM '24
Sponsor:
MM '24: The 32nd ACM International Conference on Multimedia
October 28 - November 1, 2024
Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 174
    Total Downloads
  • Downloads (Last 12 months)174
  • Downloads (Last 6 weeks)88
Reflects downloads up to 14 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media