skip to main content
10.1145/3543507.3583222acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article
Artifacts Available / v1.1

Transferring Audio Deepfake Detection Capability across Languages

Authors Info & Claims
Published:30 April 2023Publication History

ABSTRACT

The proliferation of deepfake content has motivated a surge of detection studies. However, existing detection methods in the audio area exclusively work in English, and there is a lack of data resources in other languages. Cross-lingual deepfake detection, a critical but rarely explored area, urges more study. This paper conducts the first comprehensive study on the cross-lingual perspective of deepfake detection. We observe that English data enriched in deepfake algorithms can teach a detector the knowledge of various spoofing artifacts, contributing to performing detection across language domains. Based on the observation, we first construct a first-of-its-kind cross-lingual evaluation dataset including heterogeneous spoofed speech uttered in the two most widely spoken languages, then explored domain adaptation (DA) techniques to transfer the artifacts detection capability and propose effective and practical DA strategies fitting the cross-lingual scenario. Our adversarial-based DA paradigm teaches the model to learn real/fake knowledge while losing language dependency. Extensive experiments over 137-hour audio clips validate the adapted models can detect fake audio generated by unseen algorithms in the new domain.

References

  1. Mauro Barni, Kassem Kallas, Ehsan Nowroozi, and Benedetta Tondi. 2020. CNN Detection of GAN-Generated Face Images based on Cross-Band Co-occurrences Analysis. CoRR abs/2007.12909 (2020). arXiv:2007.12909https://arxiv.org/abs/2007.12909Google ScholarGoogle Scholar
  2. Berlitz. 2021. The most spoken languages in the world. https://www.berlitz.com/blog/most-spoken-languages-world.Google ScholarGoogle Scholar
  3. Gautam Bhattacharya, Joao Monteiro, Jahangir Alam, and Patrick Kenny. 2019. Generative adversarial speaker embedding networks for domain robust end-to-end speaker verification. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6226–6230.Google ScholarGoogle Scholar
  4. Rohan Kumar Das. 2021. Known-unknown Data Augmentation Strategies for Detection of Logical Access, Physical Access and Speech Deepfake Attacks: ASVspoof 2021. In Proc. 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge. 29–36. https://doi.org/10.21437/ASVSPOOF.2021-5Google ScholarGoogle Scholar
  5. Steven Davis and Paul Mermelstein. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE transactions on acoustics, speech, and signal processing 28, 4 (1980), 357–366.Google ScholarGoogle Scholar
  6. Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cristian Canton-Ferrer. 2020. The DeepFake Detection Challenge Dataset. CoRR abs/2006.07397 (2020). arXiv:2006.07397https://arxiv.org/abs/2006.07397Google ScholarGoogle Scholar
  7. Joel Frank, Thorsten Eisenhofer, Lea Schönherr, Asja Fischer, Dorothea Kolossa, and Thorsten Holz. 2020. Leveraging Frequency Analysis for Deep Fake Image Recognition. In Proceedings of the 37th International Conference on Machine Learning(ICML’20). JMLR.org, Article 304, 12 pages.Google ScholarGoogle Scholar
  8. Joel Frank and Lea Schönherr. 2021. WaveFake: A Data Set to Facilitate Audio Deepfake Detection. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://openreview.net/forum¿id=74TZg9gsO8WGoogle ScholarGoogle Scholar
  9. Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised domain adaptation by backpropagation. In International conference on machine learning. PMLR, 1180–1189.Google ScholarGoogle Scholar
  10. Guang Hua, Andrew Beng Jin Teoh, and Haijian Zhang. 2021. Towards End-to-End Synthetic Speech Detection. IEEE Signal Processing Letters 28 (2021), 1265–1269. https://doi.org/10.1109/LSP.2021.3089437Google ScholarGoogle Scholar
  11. Yihao Huang, Felix Juefei-Xu, Qing Guo, Yang Liu, and Geguang Pu. 2022. FakeLocator: Robust Localization of GAN-Based Face Manipulations. IEEE Transactions on Information Forensics and Security 17 (2022), 2657–2672. https://doi.org/10.1109/TIFS.2022.3141262Google ScholarGoogle Scholar
  12. Junguang Jiang, Yang Shu, Jianmin Wang, and Mingsheng Long. 2022. Transferability in Deep Learning: A Survey. arxiv:2201.05867 [cs.LG]Google ScholarGoogle Scholar
  13. Liming Jiang, Zhengkui Guo, Wayne Wu, Zhaoyang Liu, Ziwei Liu, Chen Change Loy, Shuo Yang, Yuanjun Xiong, Wei Xia, Baoying Chen, Peiyu Zhuang, Sili Li, Shen Chen, Taiping Yao, Shouhong Ding, Jilin Li, Feiyue Huang, Liujuan Cao, Rongrong Ji, Changlei Lu, and Ganchao Tan. 2021. DeeperForensics Challenge 2020 on Real-World Face Forgery Detection: Methods and Results. CoRR abs/2102.09471 (2021). arXiv:2102.09471https://arxiv.org/abs/2102.09471Google ScholarGoogle Scholar
  14. Felix Juefei-Xu, Run Wang, Yihao Huang, Qing Guo, Lei Ma, and Yang Liu. 2022. Countering malicious deepfakes: Survey, battleground, and horizon. International Journal of Computer Vision (2022), 1–57.Google ScholarGoogle Scholar
  15. Felix Juefei-Xu, Run Wang, Yihao Huang, Qing Guo, Lei Ma, and Yang Liu. 2022. Countering Malicious DeepFakes: Survey, Battleground, and Horizon. Int. J. Comput. Vision 130, 7 (jul 2022), 1678–1734. https://doi.org/10.1007/s11263-022-01606-8Google ScholarGoogle Scholar
  16. J. W. Jung, H. S. Heo, H. Tak, H. J. Shim, J Son Chung, B. J. Lee, H. J. Yu, and N. Evans. 2021. AASIST: Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention Networks. In arXiv e-prints.Google ScholarGoogle Scholar
  17. Jee-weon Jung, Seung-bin Kim, Hye-jin Shim, Ju-ho Kim, and Ha-Jin Yu. 2020. Improved rawnet with feature map scaling for text-independent speaker verification using raw waveforms. arXiv preprint arXiv:2004.00526 (2020).Google ScholarGoogle Scholar
  18. Woo Hyun Kang, Jahangir Alam, and Abderrahim Fathan. 2021. CRIM’s System Description for the ASVSpoof2021 Challenge. In Proc. 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge. 100–106. https://doi.org/10.21437/ASVSPOOF.2021-16Google ScholarGoogle Scholar
  19. Piotr Kawa, Marcin Plata, and Piotr Syga. 2022. Attack Agnostic Dataset: Towards Generalization and Stabilization of Audio DeepFake Detection. arXiv preprint arXiv:2206.13979 (2022).Google ScholarGoogle Scholar
  20. Tomi Kinnunen, Md Sahidullah, Héctor Delgado, Massimiliano Todisco, Nicholas Evans, Junichi Yamagishi, and Kong Aik Lee. 2017. The ASVspoof 2017 challenge: Assessing the limits of replay spoofing attack detection. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH ’17). Stockholm, Sweden. http://www.eurecom.fr/publication/5235Google ScholarGoogle Scholar
  21. Xu Li, Xixin Wu, Hui Lu, Xunying Liu, and Helen Meng. 2021. Channel-wise gated res2net: Towards robust detection of synthetic speech attacks. arXiv preprint arXiv:2107.08803 (2021).Google ScholarGoogle Scholar
  22. Yanghao Li, Naiyan Wang, Jianping Shi, Xiaodi Hou, and Jiaying Liu. 2018. Adaptive batch normalization for practical domain adaptation. Pattern Recognition 80 (2018), 109–117.Google ScholarGoogle Scholar
  23. Weiwei Lin, Man-Mai Mak, Na Li, Dan Su, and Dong Yu. 2020. Multi-Level Deep Neural Network Adaptation for Speaker Verification Using MMD and Consistency Regularization. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6839–6843. https://doi.org/10.1109/ICASSP40776.2020.9054134Google ScholarGoogle Scholar
  24. Weiwei Lin, Man-Wai Mak, Na Li, Dan Su, and Dong Yu. 2020. A Framework for Adapting DNN Speaker Embedding Across Languages. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020), 2810–2822. https://doi.org/10.1109/TASLP.2020.3030499Google ScholarGoogle Scholar
  25. Zhenguang Liu, Sifan Wu, Chejian Xu, Xiang Wang, Lei Zhu, Shuang Wu, and Fuli Feng. 2022. Copy Motion From One to Another: Fake Motion Video Generation. In IJCAI. 1223–1231. https://doi.org/10.24963/ijcai.2022/171Google ScholarGoogle Scholar
  26. Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I. Jordan. 2015. Learning Transferable Features with Deep Adaptation Networks. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 (Lille, France) (ICML’15). JMLR.org, 97–105.Google ScholarGoogle Scholar
  27. Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I. Jordan. 2017. Deep Transfer Learning with Joint Adaptation Networks(ICML’17). JMLR.org, 2208–2217.Google ScholarGoogle Scholar
  28. Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. 2017. Deep transfer learning with joint adaptation networks. In International conference on machine learning. PMLR, 2208–2217.Google ScholarGoogle Scholar
  29. Nicolas Müller, Pavel Czempin, Franziska Diekmann, Adam Froghyar, and Konstantin Böttinger. 2022. Does Audio Deepfake Detection Generalize¿. In Proc. Interspeech 2022. 2783–2787. https://doi.org/10.21437/Interspeech.2022-108Google ScholarGoogle Scholar
  30. Jiahui Pan, Shuai Nie, Hui Zhang, Shulin He, Kanghao Zhang, Shan Liang, Xueliang Zhang, and Jianhua Tao. 2022. Speaker recognition-assisted robust audio deepfake detection. In Proc. Interspeech 2022. 4202–4206. https://doi.org/10.21437/Interspeech.2022-72Google ScholarGoogle Scholar
  31. Tanvina B Patel and Hemant A Patil. 2015. Combining evidences from mel cepstral, cochlear filter cepstral and instantaneous frequency features for detection of natural vs. spoofed speech. In Sixteenth annual conference of the international speech communication association.Google ScholarGoogle Scholar
  32. Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. 2017. Asymmetric tri-training for unsupervised domain adaptation. In International Conference on Machine Learning. PMLR, 2988–2997.Google ScholarGoogle Scholar
  33. Berrak Sisman, Junichi Yamagishi, Simon King, and Haizhou Li. 2021. An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning. 29 (jan 2021), 132–157. https://doi.org/10.1109/TASLP.2020.3038524Google ScholarGoogle Scholar
  34. Baochen Sun, Jiashi Feng, and Kate Saenko. 2016. Return of frustratingly easy domain adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 30.Google ScholarGoogle Scholar
  35. Hemlata Tak, Jee-weon Jung, Jose Patino, Madhu Kamble, Massimiliano Todisco, and Nicholas Evans. 2021. End-to-end spectro-temporal graph attention networks for speaker verification anti-spoofing and speech deepfake detection. arXiv preprint arXiv:2107.12710 (2021).Google ScholarGoogle Scholar
  36. Hemlata Tak, Jose Patino, Massimiliano Todisco, Andreas Nautsch, Nicholas Evans, and Anthony Larcher. 2021. End-to-End anti-spoofing with RawNet2. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6369–6373. https://doi.org/10.1109/ICASSP39728.2021.9414234Google ScholarGoogle Scholar
  37. Xu Tan, Tao Qin, Frank Soong, and Tie-Yan Liu. 2021. A survey on neural speech synthesis. arXiv preprint arXiv:2106.15561 (2021).Google ScholarGoogle Scholar
  38. Shahroz Tariq, Sowon Jeon, and Simon S. Woo. 2022. Am I a Real or Fake Celebrity¿ Evaluating Face Recognition and Verification APIs under Deepfake Impersonation Attack. In Proceedings of the ACM Web Conference 2022 (Virtual Event, Lyon, France) (WWW ’22). Association for Computing Machinery, New York, NY, USA, 512–523. https://doi.org/10.1145/3485447.3512212Google ScholarGoogle Scholar
  39. Massimiliano Todisco, Héctor Delgado, and Nicholas Evans. 2017. Constant Q cepstral coefficients: A spoofing countermeasure for automatic speaker verification. Computer Speech & Language 45 (2017), 516–535.Google ScholarGoogle Scholar
  40. Massimiliano Todisco, Héctor Delgado, and Nicholas WD Evans. 2016. A New Feature for Automatic Speaker Verification Anti-Spoofing: Constant Q Cepstral Coefficients.. In Odyssey, Vol. 2016. 283–290.Google ScholarGoogle Scholar
  41. Massimiliano Todisco, Xin Wang, Ville Vestman, Md Sahidullah, Hector Delgado, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Tomi Kinnunen, and Kong Aik Lee. 2019. ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection. arXiv preprint arXiv:1904.05441 (2019).Google ScholarGoogle Scholar
  42. Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. 2017. Adversarial discriminative domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7167–7176.Google ScholarGoogle Scholar
  43. Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE.Journal of machine learning research 9, 11 (2008).Google ScholarGoogle Scholar
  44. Hongji Wang, Heinrich Dinkel, Shuai Wang, Yanmin Qian, and Kai Yu. 2020. Dual-Adversarial Domain Adaptation for Generalized Replay Attack Detection. In Proc. Interspeech 2020. 1086–1090. https://doi.org/10.21437/Interspeech.2020-1255Google ScholarGoogle Scholar
  45. Xin Wang and Junich Yamagishi. 2021. A comparative study on recent neural spoofing countermeasures for synthetic speech detection. arXiv preprint arXiv:2103.11326 (2021).Google ScholarGoogle Scholar
  46. Xin Wang and Junichi Yamagishi. 2021. Investigating self-supervised front ends for speech spoofing countermeasures. arXiv preprint arXiv:2111.07725 (2021).Google ScholarGoogle Scholar
  47. Zhenyu Wang and John H. L. Hansen. 2022. Multi-Source Domain Adaptation for Text-Independent Forensic Speaker Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022), 60–75. https://doi.org/10.1109/TASLP.2021.3130975Google ScholarGoogle Scholar
  48. Yan Wen, Zhenchun Lei, Yingen Yang, Changhong Liu, and Minglei Ma. 2022. Multi-Path GMM-MobileNet Based on Attack Algorithms and Codecs for Synthetic Speech and Deepfake Detection. In Proc. Interspeech 2022. 4795–4799. https://doi.org/10.21437/Interspeech.2022-10312Google ScholarGoogle Scholar
  49. Garrett Wilson and Diane J. Cook. 2020. A Survey of Unsupervised Deep Domain Adaptation. ACM Trans. Intell. Syst. Technol. 11, 5, Article 51 (jul 2020), 46 pages. https://doi.org/10.1145/3400066Google ScholarGoogle Scholar
  50. Zhizheng Wu, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi, Cemal Hanilçi, Md Sahidullah, and Aleksandr Sizov. 2015. ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge. In Sixteenth Annual Conference of the International Speech Communication Association.Google ScholarGoogle Scholar
  51. Wei Xia, Jing Huang, and John HL Hansen. 2019. Cross-lingual text-independent speaker verification using unsupervised adversarial discriminative domain adaptation. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5816–5820.Google ScholarGoogle Scholar
  52. Yang Xie, Zhenchuan Zhang, and Yingchun Yang. 2021. Siamese Network with wav2vec Feature for Spoofing Speech Detection.. In Interspeech. 4269–4273.Google ScholarGoogle Scholar
  53. Junichi Yamagishi, Xin Wang, Massimiliano Todisco, Md Sahidullah, Jose Patino, Andreas Nautsch, Xuechen Liu, Kong Aik Lee, Tomi Kinnunen, Nicholas Evans, 2021. ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection. arXiv preprint arXiv:2109.00537 (2021).Google ScholarGoogle Scholar
  54. Yifang Yin, Harsh Shrivastava, Ying Zhang, Zhenguang Liu, Rajiv Ratn Shah, and Roger Zimmermann. 2021. Enhanced audio tagging via multi-to single-modal teacher-student mutual learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 10709–10717.Google ScholarGoogle Scholar
  55. You Zhang, Fei Jiang, and Zhiyao Duan. 2021. One-Class Learning Towards Synthetic Voice Spoofing Detection. IEEE Signal Processing Letters 28 (2021), 937–941. https://doi.org/10.1109/lsp.2021.3076358Google ScholarGoogle Scholar
  56. You Zhang, Ge Zhu, Fei Jiang, and Zhiyao Duan. 2021. An Empirical Study on Channel Effects for Synthetic Voice Spoofing Countermeasure Systems. In Proc. Interspeech 2021. 4309–4313. https://doi.org/10.21437/Interspeech.2021-1820Google ScholarGoogle Scholar
  57. Zhenyu Zhang, Yewei Gu, Xiaowei Yi, and Xianfeng Zhao. 2021. FMFCC-A: A Challenging Mandarin Dataset for Synthetic Speech Detection. CoRR abs/2110.09441 (2021). arXiv:2110.09441https://arxiv.org/abs/2110.09441Google ScholarGoogle Scholar

Index Terms

  1. Transferring Audio Deepfake Detection Capability across Languages

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            WWW '23: Proceedings of the ACM Web Conference 2023
            April 2023
            4293 pages
            ISBN:9781450394161
            DOI:10.1145/3543507

            Copyright © 2023 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 30 April 2023

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed limited

            Acceptance Rates

            Overall Acceptance Rate1,899of8,196submissions,23%

            Upcoming Conference

            WWW '24
            The ACM Web Conference 2024
            May 13 - 17, 2024
            Singapore , Singapore

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format .

          View HTML Format