research-article

An End-to-End Audio Transformer with Multi-student Knowledge Distillation algorithm for Deepfake Speech Detection

Authors:

Huaiyu LiAuthors Info & Claims

ICCPR '24: Proceedings of the 2024 13th International Conference on Computing and Pattern Recognition

Pages 366 - 371

https://doi.org/10.1145/3704323.3704334

Published: 07 January 2025 Publication History

Abstract

An increased prevalence of fraudulent techniques has revealed the limitations in performance and detection speed of existing Spoofed Speech Detection(SSD) algorithms. To address these challenges, a more stable and rapid algorithm is proposed in this paper. Firstly, a novel feature extraction algorithm is introduced, in this algorithm we employing an end-to-end extraction frontend combined with a feature smoothing mechanism to extract more robust feature representations. Secondly, a one teacher and multi-student knowledge distillation system, guided by an Audio transformer as the teacher model. This system comprises two distinct networks: the teacher network and student network. Through a one-teacher and multiple-students knowledge distillation structure, the model achieves faster detection speeds without compromising performance, meeting the requirements for real-time processing. Finally, the algorithm utilizes the ASVspoof2021 LA dataset to simulate unknown attacks and employs pseudo labels generated by the teacher model to train the students model, thus enhancing the system's capability to handle increasingly variable unknown attacks in the future. Experimental results demonstrate that on the ASVspoof2019 evaluation set the proposed algorithm reaches optimal performance with the minimum model parameters that only 0.33M. Moreover, on the ASVspoof2021 LA and ASVspoof2021 DF evaluation sets, the algorithm proposed in this paper achieves performance close to the state-of-the-art (SOTA) algorithms while requires only 7.64ms for single speech inference on a CPU, fulfilling the real-time processing criteria.

References

[1]

Z. Bai and X.-L. Zhang, “Speaker Recognition Based on Deep Learning: An Overview.” arXiv, Apr. 03, 2021. Accessed: May 21, 2024. [Online]. Available: http://arxiv.org/abs/2012.00931

[2]

C. Hanilçi, “Data selection for i-vector based automatic speaker verification anti-spoofing,” Digital Signal Processing, vol. 72, pp. 171–180, Jan. 2018.

Digital Library

[3]

J. Monteiro, J. Alam, and T. H. Falk, “Generalized end-to-end detection of spoofing attacks to automatic speaker recognizers,” Computer Speech & Language, vol. 63, p. 101096, Sep. 2020.

[4]

T. Chen, A. Kumar, P. Nagarsheth, G. Sivaraman, and E. Khoury, “Generalization of Audio Deepfake Detection,” in The Speaker and Language Recognition Workshop (Odyssey 2020), ISCA, Nov. 2020, pp. 132–137.

[5]

X. Wang and J. Yamagishi, “Investigating self-supervised front ends for speech spoofing countermeasures.” arXiv, Feb. 04, 2022. Accessed: May 21, 2024. [Online]. Available: http://arxiv.org/abs/2111.07725

[6]

X. Liu et al., “ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 31, pp. 2507–2522, 2023.

Digital Library

[7]

A. Tomilov, A. Svishchev, M. Volkova, A. Chirkovskiy, A. Kondratev, and G. Lavrentyeva, “STC Antispoofing Systems for the ASVspoof2021 Challenge,” in 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, ISCA, Sep. 2021, pp. 61–67.

[8]

Y. Wang, P. Getreuer, T. Hughes, R. F. Lyon, and R. A. Saurous, “Trainable Frontend For Robust and Far-Field Keyword Spotting.” arXiv, Jul. 19, 2016. Accessed: Nov. 18, 2023. [Online]. Available: http://arxiv.org/abs/1607.05666.

[9]

G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in a Neural Network.” arXiv, Mar. 09, 2015. Accessed: Jul. 01, 2024. [Online]. Available: http://arxiv.org/abs/1503.02531

[10]

M. Todisco et al., “ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection.” arXiv, Apr. 14, 2019. Accessed: May 21, 2024. [Online]. Available: http://arxiv.org/abs/1904.05441.

[11]

J. Yamagishi, X. Wang, M. Todisco, M. Sahidullah, J. Patino, A. Nautsch, X. Liu, K. A. Lee, T. Kinnunen, N. Evans et al., “Asvspoof 2021: accelerating progress in spoofed and deepfake speech detection,”

[12]

X. Li et al., “Replay and Synthetic Speech Detection with Res2net Architecture.” arXiv, Feb. 13, 2021. Accessed: May 21, 2024. [Online]. Available: http://arxiv.org/abs/2010.15006.

[13]

G. Hua, A. B. J. Teoh, and H. Zhang, “Towards End-to-End Synthetic Speech Detection,” IEEE Signal Process. Lett., vol. 28, pp. 1265–1269, 2021.

[14]

C. Wang et al., “TO-Rawnet: Improving RawNet with TCN and Orthogonal Regularization for Fake Audio Detection,” in INTERSPEECH 2023, ISCA, Aug. 2023, pp. 3137–3141.

[15]

C. Wang, J. Yi, X. Zhang, J. Tao, L. Xu, and R. Fu, “Low-rank Adaptation Method for Wav2vec2-based Fake Audio Detection”.

[16]

R. K. Das, “Known-unknown Data Augmentation Strategies for Detection of Logical Access, Physical Access and Speech Deepfake Attacks: ASVspoof 2021,” in 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, ISCA, Sep. 2021, pp. 29–36.

[17]

X. Chen, Y. Zhang, G. Zhu, and Z. Duan, “UR Channel-Robust Synthetic Speech Detection System for ASVspoof 2021,” in 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, ISCA, Sep. 2021, pp. 75–82.

[18]

A. Cohen, I. Rimon, E. Aflalo, and H. Permuter, “A Study On Data Augmentation In Voice Anti-Spoofing.” arXiv, Oct. 20, 2021. Accessed: Mar. 19, 2024. [Online]. Available: http://arxiv.org/abs/2110.10491.

Index Terms

An End-to-End Audio Transformer with Multi-student Knowledge Distillation algorithm for Deepfake Speech Detection

Recommendations

Towards end-to-end speech recognition with transfer learning

A transfer learning-based end-to-end speech recognition approach is presented in two levels in our framework. Firstly, a feature extraction approach combining multilingual deep neural network (DNN) training with matrix factorization algorithm is ...
Unconstrained end-to-end text reading with feature rectification
Highlights
- We propose an end-to-end trainable text spotting framework.
- We find and deal with the features incompatibility problem.
- PSN is proposed to rectify the proposal features in the recognition branch.
- Experiments have demonstrated ...
Abstract
We propose an end-to-end trainable network that can simultaneously localize and recognize irregular text from images. Specifically, we find the feature incompatibility problem, which arises from the contradiction between detection and recognition ...
A Streaming End-to-End Speech Recognition Approach Based on WeNet for Tibetan Amdo Dialect
MLNLP '22: Proceedings of the 2022 5th International Conference on Machine Learning and Natural Language Processing

Speech recognition is a technique to transcribe acoustic features into text sequences. However, traditional speech recognition model cannot get an effective performance, when dealed with Tibetan Amdo dialect dataset which requires a large amount of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICCPR '24: Proceedings of the 2024 13th International Conference on Computing and Pattern Recognition

October 2024

448 pages

ISBN:9798400717482

DOI:10.1145/3704323

Copyright © 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 January 2025

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICCPR 2024

ICCPR 2024: 2024 13th International Conference on Computing and Pattern Recognition

October 25 - 27, 2024

Tianjin, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
44
Total Downloads

Downloads (Last 12 months)44
Downloads (Last 6 weeks)34

Reflects downloads up to 25 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Figures

Tables

Media

View full text|Download PDF

View Table of Conten