skip to main content
10.1145/3512732.3533585acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
research-article

On the Generalizability of Two-dimensional Convolutional Neural Networks for Fake Speech Detection

Authors Info & Claims
Published:27 June 2022Publication History

ABSTRACT

The powerful capabilities of modern text-to-speech methods to produce synthetic computer generated voice, can pose a problem in terms of discerning real from fake audio. In the present work, different pipelines were tested and the best in terms of inference time and audio quality was selected to expand on the real audio of the TIMIT dataset. This led to the creation of a new fake audio detection dataset based on the TIMIT corpus. A range of different audio representations (magnitude spectrogram and energies representations) were studied in terms of performance on both datasets, with the two-dimensional convolutional neural networks trained only on the Fake or Real (FoR) dataset. While there was not a single best representation in terms of performance on both datasets, the Mel spectrogram and Mel energies representations were found to be more robust overall. No performance difference in recognition accuracy was evident during validation, while the two-dimensional convolutional neural network model showed a tendency to under-perform on the test set of the FoR dataset and the synthesized one based on the TIMIT corpus, regardless of the representation used. This fact was corroborated by the data distribution analysis that is presented in the present work.

Skip Supplemental Material Section

Supplemental Material

MAD22-fp37.mp4

mp4

22.8 MB

References

  1. Yisroel Mirsky and Wenke Lee. The creation and detection of deepfakes: A survey. ACM Computing Surveys (CSUR), 54(1):1--41, 2021.Google ScholarGoogle Scholar
  2. Ruben Tolosana, Ruben Vera-Rodriguez, Julian Fierrez, Aythami Morales, and Javier Ortega-Garcia. Deepfakes and beyond: A survey of face manipulation and fake detection. Information Fusion, 64:131--148, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  3. Engineering National Academies of Sciences, Medicine, et al. Implications of artificial intelligence for cybersecurity: Proceedings of a workshop. National Academies Press, 2020.Google ScholarGoogle Scholar
  4. Ricardo Reimao and Vassilios Tzerpos. For: A dataset for synthetic speech detection. In 2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pages 1--10. IEEE, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  5. John S Garofolo. Timit acoustic phonetic continuous speech corpus. Linguistic Data Consortium, 1993, 1993.Google ScholarGoogle ScholarCross RefCross Ref
  6. Xiaoyong Yuan, Pan He, Qile Zhu, and Xiaolin Li. Adversarial examples: Attacks and defenses for deep learning. IEEE transactions on neural networks and learning systems, 30(9):2805--2824, 2019.Google ScholarGoogle Scholar
  7. Paul Taylor. Text-to-speech synthesis. Cambridge university press, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  8. Zhen-Hua Ling, Li Deng, and Dong Yu. Modeling spectral envelopes using restricted boltzmann machines and deep belief networks for statistical parametric speech synthesis. IEEE transactions on audio, speech, and language processing, 21(10):2129--2139, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Shiyin Kang, Xiaojun Qian, and Helen Meng. Multi-distribution deep belief network for speech synthesis. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 8012--8016. IEEE, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  10. Christopher M Bishop. Mixture density networks. Technical report, 1994.Google ScholarGoogle Scholar
  11. Runnan Li, Zhiyong Wu, Xunying Liu, Helen Meng, and Lianhong Cai. Multi-task learning of structured output layer bidirectional lstms for speech synthesis. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5510--5514. IEEE, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135, 2017.Google ScholarGoogle Scholar
  13. Douglas A Reynolds. Speaker identification and verification using gaussian mixture speaker models. Speech communication, 17(1--2):91--108, 1995.Google ScholarGoogle Scholar
  14. Douglas A Reynolds, Thomas F Quatieri, and Robert B Dunn. Speaker verification using adapted gaussian mixture models. Digital signal processing, 10(1--3):19--41, 2000.Google ScholarGoogle Scholar
  15. Jian Zhao, Yuan Dong, Xianyu Zhao, Hao Yang, Liang Lu, and Haila Wang. Advances in svm-based system using gmm super vectors for text-independent speaker verification. Tsinghua Science and Technology, 13(4):522--527, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  16. Yi-Hsiang Chao, Wei-Ho Tsai, Hsin-Min Wang, and Ruei-Chuan Chang. Using kernel discriminant analysis to improve the characterization of the alternative hypothesis for speaker verification. IEEE transactions on audio, speech, and language processing, 16(8):1675--1684, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Sibel Yaman and Jason Pelecanos. Using polynomial kernel support vector machines for speaker verification. IEEE Signal Processing Letters, 20(9):901--904, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  18. Abdolreza Rashno, Seyed Mohammad Ahadi, and Manoochehr Kelarestaghi. Text-independent speaker verification with ant colony optimization feature selection and support vector machine. In 2015 2nd International Conference on Pattern Recognition and Image Analysis (IPRIA), pages 1--5. IEEE, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  19. Róis'in Loughran, Alexandros Agapitos, Ahmed Kattan, Anthony Brabazon, and Michael O'Neill. Feature selection for speaker verification using genetic programming. Evolutionary Intelligence, 10(1):1--21, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  20. Yuan Liu, Yanmin Qian, Nanxin Chen, Tianfan Fu, Ya Zhang, and Kai Yu. Deep feature for text-dependent speaker verification. Speech Communication, 73:1--13, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Yong Feng, Qingyu Xiong, and Weiren Shi. Deep nonlinear metric learning for speaker verification in the i-vector space. IEICE TRANSACTIONS on Information and Systems, 100(1):215--219, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  22. Supaporn Bunrit, Thuttaphol Inkian, Nittaya Kerdprasop, and Kittisak Kerdprasop. Text-independent speaker identification using deep learning model of convolution neural network. International Journal of Machine Learning and Computing, 9(2):143--148, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  23. Dora M Ballesteros, Yohanna Rodriguez-Ortega, Diego Renza, and Gonzalo Arce. Deep4snet: deep learning for fake speech classification. Expert Systems with Applications, 184:115465, 2021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Run Wang, Felix Juefei-Xu, Yihao Huang, Qing Guo, Xiaofei Xie, Lei Ma, and Yang Liu. Deepsonar: Towards effective and robust detection of ai-synthesized fake voices. In Proceedings of the 28th ACM International Conference on Multimedia, pages 1207--1216, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition, 2015.Google ScholarGoogle Scholar
  26. Ricardo Reimao and Vassilios Tzerpos. For: A dataset for synthetic speech detection. In IEEE International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pages 1--10, Timisoara, Romania, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  27. Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, Rif A. Saurous, Yannis Agiomvrgiannakis, and Yonghui Wu. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4779--4783, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brébisson, Yoshua Bengio, and Aaron C Courville. Melgan: Generative adversarial networks for conditional waveform synthesis. Advances in neural information processing systems, 32, 2019.Google ScholarGoogle Scholar
  29. Sajjad Abdoli, Patrick Cardinal, and Alessandro Lameiras Koerich. End-to-end environmental sound classification using a 1D convolutional neural network. Expert Systems with Applications, 136:252--263, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ioannis Papadimitriou, Anastasios Vafeiadis, Antonios Lalas, Konstantinos Votis, and Dimitrios Tzovaras. Audio-based event detection at different snr settings using two-dimensional spectrogram magnitude representations. Electronics, 9(10):1593, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  31. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7--9, 2015, Conference Track Proceedings, 2015.Google ScholarGoogle Scholar
  32. Maged MM Fahmy. Palmprint recognition based on mel frequency cepstral coefficients feature extraction. Ain Shams Engineering Journal, 1(1):39--47, 2010.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. On the Generalizability of Two-dimensional Convolutional Neural Networks for Fake Speech Detection

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MAD '22: Proceedings of the 1st International Workshop on Multimedia AI against Disinformation
      June 2022
      93 pages
      ISBN:9781450392426
      DOI:10.1145/3512732

      Copyright © 2022 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 June 2022

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Upcoming Conference

      ICMR '24
      International Conference on Multimedia Retrieval
      June 10 - 14, 2024
      Phuket , Thailand

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader