skip to main content
10.1145/3512732.3533585acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
research-article

On the Generalizability of Two-dimensional Convolutional Neural Networks for Fake Speech Detection

Published: 27 June 2022 Publication History

Abstract

The powerful capabilities of modern text-to-speech methods to produce synthetic computer generated voice, can pose a problem in terms of discerning real from fake audio. In the present work, different pipelines were tested and the best in terms of inference time and audio quality was selected to expand on the real audio of the TIMIT dataset. This led to the creation of a new fake audio detection dataset based on the TIMIT corpus. A range of different audio representations (magnitude spectrogram and energies representations) were studied in terms of performance on both datasets, with the two-dimensional convolutional neural networks trained only on the Fake or Real (FoR) dataset. While there was not a single best representation in terms of performance on both datasets, the Mel spectrogram and Mel energies representations were found to be more robust overall. No performance difference in recognition accuracy was evident during validation, while the two-dimensional convolutional neural network model showed a tendency to under-perform on the test set of the FoR dataset and the synthesized one based on the TIMIT corpus, regardless of the representation used. This fact was corroborated by the data distribution analysis that is presented in the present work.

Supplementary Material

MP4 File (MAD22-fp37.mp4)
Presentation video for the study on generalization of 2D CNNs for fake speech detection.

References

[1]
Yisroel Mirsky and Wenke Lee. The creation and detection of deepfakes: A survey. ACM Computing Surveys (CSUR), 54(1):1--41, 2021.
[2]
Ruben Tolosana, Ruben Vera-Rodriguez, Julian Fierrez, Aythami Morales, and Javier Ortega-Garcia. Deepfakes and beyond: A survey of face manipulation and fake detection. Information Fusion, 64:131--148, 2020.
[3]
Engineering National Academies of Sciences, Medicine, et al. Implications of artificial intelligence for cybersecurity: Proceedings of a workshop. National Academies Press, 2020.
[4]
Ricardo Reimao and Vassilios Tzerpos. For: A dataset for synthetic speech detection. In 2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pages 1--10. IEEE, 2019.
[5]
John S Garofolo. Timit acoustic phonetic continuous speech corpus. Linguistic Data Consortium, 1993, 1993.
[6]
Xiaoyong Yuan, Pan He, Qile Zhu, and Xiaolin Li. Adversarial examples: Attacks and defenses for deep learning. IEEE transactions on neural networks and learning systems, 30(9):2805--2824, 2019.
[7]
Paul Taylor. Text-to-speech synthesis. Cambridge university press, 2009.
[8]
Zhen-Hua Ling, Li Deng, and Dong Yu. Modeling spectral envelopes using restricted boltzmann machines and deep belief networks for statistical parametric speech synthesis. IEEE transactions on audio, speech, and language processing, 21(10):2129--2139, 2013.
[9]
Shiyin Kang, Xiaojun Qian, and Helen Meng. Multi-distribution deep belief network for speech synthesis. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 8012--8016. IEEE, 2013.
[10]
Christopher M Bishop. Mixture density networks. Technical report, 1994.
[11]
Runnan Li, Zhiyong Wu, Xunying Liu, Helen Meng, and Lianhong Cai. Multi-task learning of structured output layer bidirectional lstms for speech synthesis. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5510--5514. IEEE, 2017.
[12]
Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135, 2017.
[13]
Douglas A Reynolds. Speaker identification and verification using gaussian mixture speaker models. Speech communication, 17(1--2):91--108, 1995.
[14]
Douglas A Reynolds, Thomas F Quatieri, and Robert B Dunn. Speaker verification using adapted gaussian mixture models. Digital signal processing, 10(1--3):19--41, 2000.
[15]
Jian Zhao, Yuan Dong, Xianyu Zhao, Hao Yang, Liang Lu, and Haila Wang. Advances in svm-based system using gmm super vectors for text-independent speaker verification. Tsinghua Science and Technology, 13(4):522--527, 2008.
[16]
Yi-Hsiang Chao, Wei-Ho Tsai, Hsin-Min Wang, and Ruei-Chuan Chang. Using kernel discriminant analysis to improve the characterization of the alternative hypothesis for speaker verification. IEEE transactions on audio, speech, and language processing, 16(8):1675--1684, 2008.
[17]
Sibel Yaman and Jason Pelecanos. Using polynomial kernel support vector machines for speaker verification. IEEE Signal Processing Letters, 20(9):901--904, 2013.
[18]
Abdolreza Rashno, Seyed Mohammad Ahadi, and Manoochehr Kelarestaghi. Text-independent speaker verification with ant colony optimization feature selection and support vector machine. In 2015 2nd International Conference on Pattern Recognition and Image Analysis (IPRIA), pages 1--5. IEEE, 2015.
[19]
Róis'in Loughran, Alexandros Agapitos, Ahmed Kattan, Anthony Brabazon, and Michael O'Neill. Feature selection for speaker verification using genetic programming. Evolutionary Intelligence, 10(1):1--21, 2017.
[20]
Yuan Liu, Yanmin Qian, Nanxin Chen, Tianfan Fu, Ya Zhang, and Kai Yu. Deep feature for text-dependent speaker verification. Speech Communication, 73:1--13, 2015.
[21]
Yong Feng, Qingyu Xiong, and Weiren Shi. Deep nonlinear metric learning for speaker verification in the i-vector space. IEICE TRANSACTIONS on Information and Systems, 100(1):215--219, 2017.
[22]
Supaporn Bunrit, Thuttaphol Inkian, Nittaya Kerdprasop, and Kittisak Kerdprasop. Text-independent speaker identification using deep learning model of convolution neural network. International Journal of Machine Learning and Computing, 9(2):143--148, 2019.
[23]
Dora M Ballesteros, Yohanna Rodriguez-Ortega, Diego Renza, and Gonzalo Arce. Deep4snet: deep learning for fake speech classification. Expert Systems with Applications, 184:115465, 2021.
[24]
Run Wang, Felix Juefei-Xu, Yihao Huang, Qing Guo, Xiaofei Xie, Lei Ma, and Yang Liu. Deepsonar: Towards effective and robust detection of ai-synthesized fake voices. In Proceedings of the 28th ACM International Conference on Multimedia, pages 1207--1216, 2020.
[25]
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition, 2015.
[26]
Ricardo Reimao and Vassilios Tzerpos. For: A dataset for synthetic speech detection. In IEEE International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pages 1--10, Timisoara, Romania, 2019.
[27]
Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, Rif A. Saurous, Yannis Agiomvrgiannakis, and Yonghui Wu. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4779--4783, 2018.
[28]
Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brébisson, Yoshua Bengio, and Aaron C Courville. Melgan: Generative adversarial networks for conditional waveform synthesis. Advances in neural information processing systems, 32, 2019.
[29]
Sajjad Abdoli, Patrick Cardinal, and Alessandro Lameiras Koerich. End-to-end environmental sound classification using a 1D convolutional neural network. Expert Systems with Applications, 136:252--263, 2019.
[30]
Ioannis Papadimitriou, Anastasios Vafeiadis, Antonios Lalas, Konstantinos Votis, and Dimitrios Tzovaras. Audio-based event detection at different snr settings using two-dimensional spectrogram magnitude representations. Electronics, 9(10):1593, 2020.
[31]
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7--9, 2015, Conference Track Proceedings, 2015.
[32]
Maged MM Fahmy. Palmprint recognition based on mel frequency cepstral coefficients feature extraction. Ain Shams Engineering Journal, 1(1):39--47, 2010.

Cited By

View all
  • (2024)Video and Audio Deepfake Datasets and Open Issues in Deepfake Technology: Being Ahead of the CurveForensic Sciences10.3390/forensicsci40300214:3(289-377)Online publication date: 13-Jul-2024
  • (2024)An Intelligent Method for Recognizing Real and Generated Ukrainian-Language Voices2024 IEEE 5th KhPI Week on Advanced Technology (KhPIWeek)10.1109/KhPIWeek61434.2024.10877988(1-6)Online publication date: 7-Oct-2024
  • (2024)Audio Transformer for Synthetic Speech Detection via Formant Magnitude and Phase AnalysisICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10445932(4805-4809)Online publication date: 14-Apr-2024
  • Show More Cited By

Index Terms

  1. On the Generalizability of Two-dimensional Convolutional Neural Networks for Fake Speech Detection

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MAD '22: Proceedings of the 1st International Workshop on Multimedia AI against Disinformation
    June 2022
    93 pages
    ISBN:9781450392426
    DOI:10.1145/3512732
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 June 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. convolutional neural networks
    2. fake speech detection
    3. magnitude spectrogram representations
    4. neural vocoders

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    ICMR '22
    Sponsor:

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)44
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 17 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Video and Audio Deepfake Datasets and Open Issues in Deepfake Technology: Being Ahead of the CurveForensic Sciences10.3390/forensicsci40300214:3(289-377)Online publication date: 13-Jul-2024
    • (2024)An Intelligent Method for Recognizing Real and Generated Ukrainian-Language Voices2024 IEEE 5th KhPI Week on Advanced Technology (KhPIWeek)10.1109/KhPIWeek61434.2024.10877988(1-6)Online publication date: 7-Oct-2024
    • (2024)Audio Transformer for Synthetic Speech Detection via Formant Magnitude and Phase AnalysisICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10445932(4805-4809)Online publication date: 14-Apr-2024
    • (2024)Generation and detection of manipulated multimodal audiovisual contentInformation Fusion10.1016/j.inffus.2023.102103103:COnline publication date: 4-Mar-2024
    • (2023)Voice Cloning and Forgery Detection Using WaveGAN and SpecGAN2023 7th International Conference On Computing, Communication, Control And Automation (ICCUBEA)10.1109/ICCUBEA58933.2023.10392082(1-6)Online publication date: 18-Aug-2023
    • (2023)A Survey on the Detection and Impacts of Deepfakes in Visual, Audio, and Textual FormatsIEEE Access10.1109/ACCESS.2023.334465311(144497-144529)Online publication date: 2023
    • (2022)Open Challenges in Synthetic Speech Detection2022 IEEE International Workshop on Information Forensics and Security (WIFS)10.1109/WIFS55849.2022.9975433(1-6)Online publication date: 12-Dec-2022

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media