skip to main content
10.1145/3704323.3704334acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiccprConference Proceedingsconference-collections
research-article

An End-to-End Audio Transformer with Multi-student Knowledge Distillation algorithm for Deepfake Speech Detection

Published: 07 January 2025 Publication History

Abstract

An increased prevalence of fraudulent techniques has revealed the limitations in performance and detection speed of existing Spoofed Speech Detection(SSD) algorithms. To address these challenges, a more stable and rapid algorithm is proposed in this paper. Firstly, a novel feature extraction algorithm is introduced, in this algorithm we employing an end-to-end extraction frontend combined with a feature smoothing mechanism to extract more robust feature representations. Secondly, a one teacher and multi-student knowledge distillation system, guided by an Audio transformer as the teacher model. This system comprises two distinct networks: the teacher network and student network. Through a one-teacher and multiple-students knowledge distillation structure, the model achieves faster detection speeds without compromising performance, meeting the requirements for real-time processing. Finally, the algorithm utilizes the ASVspoof2021 LA dataset to simulate unknown attacks and employs pseudo labels generated by the teacher model to train the students model, thus enhancing the system's capability to handle increasingly variable unknown attacks in the future. Experimental results demonstrate that on the ASVspoof2019 evaluation set the proposed algorithm reaches optimal performance with the minimum model parameters that only 0.33M. Moreover, on the ASVspoof2021 LA and ASVspoof2021 DF evaluation sets, the algorithm proposed in this paper achieves performance close to the state-of-the-art (SOTA) algorithms while requires only 7.64ms for single speech inference on a CPU, fulfilling the real-time processing criteria.

References

[1]
Z. Bai and X.-L. Zhang, “Speaker Recognition Based on Deep Learning: An Overview.” arXiv, Apr. 03, 2021. Accessed: May 21, 2024. [Online]. Available: http://arxiv.org/abs/2012.00931
[2]
C. Hanilçi, “Data selection for i-vector based automatic speaker verification anti-spoofing,” Digital Signal Processing, vol. 72, pp. 171–180, Jan. 2018.
[3]
J. Monteiro, J. Alam, and T. H. Falk, “Generalized end-to-end detection of spoofing attacks to automatic speaker recognizers,” Computer Speech & Language, vol. 63, p. 101096, Sep. 2020.
[4]
T. Chen, A. Kumar, P. Nagarsheth, G. Sivaraman, and E. Khoury, “Generalization of Audio Deepfake Detection,” in The Speaker and Language Recognition Workshop (Odyssey 2020), ISCA, Nov. 2020, pp. 132–137.
[5]
X. Wang and J. Yamagishi, “Investigating self-supervised front ends for speech spoofing countermeasures.” arXiv, Feb. 04, 2022. Accessed: May 21, 2024. [Online]. Available: http://arxiv.org/abs/2111.07725
[6]
X. Liu et al., “ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 31, pp. 2507–2522, 2023.
[7]
A. Tomilov, A. Svishchev, M. Volkova, A. Chirkovskiy, A. Kondratev, and G. Lavrentyeva, “STC Antispoofing Systems for the ASVspoof2021 Challenge,” in 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, ISCA, Sep. 2021, pp. 61–67.
[8]
Y. Wang, P. Getreuer, T. Hughes, R. F. Lyon, and R. A. Saurous, “Trainable Frontend For Robust and Far-Field Keyword Spotting.” arXiv, Jul. 19, 2016. Accessed: Nov. 18, 2023. [Online]. Available: http://arxiv.org/abs/1607.05666.
[9]
G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in a Neural Network.” arXiv, Mar. 09, 2015. Accessed: Jul. 01, 2024. [Online]. Available: http://arxiv.org/abs/1503.02531
[10]
M. Todisco et al., “ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection.” arXiv, Apr. 14, 2019. Accessed: May 21, 2024. [Online]. Available: http://arxiv.org/abs/1904.05441.
[11]
J. Yamagishi, X. Wang, M. Todisco, M. Sahidullah, J. Patino, A. Nautsch, X. Liu, K. A. Lee, T. Kinnunen, N. Evans et al., “Asvspoof 2021: accelerating progress in spoofed and deepfake speech detection,”
[12]
X. Li et al., “Replay and Synthetic Speech Detection with Res2net Architecture.” arXiv, Feb. 13, 2021. Accessed: May 21, 2024. [Online]. Available: http://arxiv.org/abs/2010.15006.
[13]
G. Hua, A. B. J. Teoh, and H. Zhang, “Towards End-to-End Synthetic Speech Detection,” IEEE Signal Process. Lett., vol. 28, pp. 1265–1269, 2021.
[14]
C. Wang et al., “TO-Rawnet: Improving RawNet with TCN and Orthogonal Regularization for Fake Audio Detection,” in INTERSPEECH 2023, ISCA, Aug. 2023, pp. 3137–3141.
[15]
C. Wang, J. Yi, X. Zhang, J. Tao, L. Xu, and R. Fu, “Low-rank Adaptation Method for Wav2vec2-based Fake Audio Detection”.
[16]
R. K. Das, “Known-unknown Data Augmentation Strategies for Detection of Logical Access, Physical Access and Speech Deepfake Attacks: ASVspoof 2021,” in 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, ISCA, Sep. 2021, pp. 29–36.
[17]
X. Chen, Y. Zhang, G. Zhu, and Z. Duan, “UR Channel-Robust Synthetic Speech Detection System for ASVspoof 2021,” in 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, ISCA, Sep. 2021, pp. 75–82.
[18]
A. Cohen, I. Rimon, E. Aflalo, and H. Permuter, “A Study On Data Augmentation In Voice Anti-Spoofing.” arXiv, Oct. 20, 2021. Accessed: Mar. 19, 2024. [Online]. Available: http://arxiv.org/abs/2110.10491.

Index Terms

  1. An End-to-End Audio Transformer with Multi-student Knowledge Distillation algorithm for Deepfake Speech Detection

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Other conferences
        ICCPR '24: Proceedings of the 2024 13th International Conference on Computing and Pattern Recognition
        October 2024
        448 pages
        ISBN:9798400717482
        DOI:10.1145/3704323
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 07 January 2025

        Check for updates

        Author Tags

        1. ASVspoof2021
        2. Spoofed speech detection
        3. end-to-end
        4. knowledge distillation

        Qualifiers

        • Research-article

        Conference

        ICCPR 2024

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 44
          Total Downloads
        • Downloads (Last 12 months)44
        • Downloads (Last 6 weeks)34
        Reflects downloads up to 25 Feb 2025

        Other Metrics

        Citations

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        Full Text

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media