research-article

On the Generalizability of Two-dimensional Convolutional Neural Networks for Fake Speech Detection

Authors:

Christoforos Papastergiopoulos,

Anastasios Vafeiadis,

Ioannis Papadimitriou,

Konstantinos Votis,

Dimitrios TzovarasAuthors Info & Claims

MAD '22: Proceedings of the 1st International Workshop on Multimedia AI against Disinformation

Pages 3 - 9

https://doi.org/10.1145/3512732.3533585

Published: 27 June 2022 Publication History

Abstract

The powerful capabilities of modern text-to-speech methods to produce synthetic computer generated voice, can pose a problem in terms of discerning real from fake audio. In the present work, different pipelines were tested and the best in terms of inference time and audio quality was selected to expand on the real audio of the TIMIT dataset. This led to the creation of a new fake audio detection dataset based on the TIMIT corpus. A range of different audio representations (magnitude spectrogram and energies representations) were studied in terms of performance on both datasets, with the two-dimensional convolutional neural networks trained only on the Fake or Real (FoR) dataset. While there was not a single best representation in terms of performance on both datasets, the Mel spectrogram and Mel energies representations were found to be more robust overall. No performance difference in recognition accuracy was evident during validation, while the two-dimensional convolutional neural network model showed a tendency to under-perform on the test set of the FoR dataset and the synthesized one based on the TIMIT corpus, regardless of the representation used. This fact was corroborated by the data distribution analysis that is presented in the present work.

Supplementary Material

MP4 File (MAD22-fp37.mp4)

Presentation video for the study on generalization of 2D CNNs for fake speech detection.

Download
22.76 MB

References

[1]

Yisroel Mirsky and Wenke Lee. The creation and detection of deepfakes: A survey. ACM Computing Surveys (CSUR), 54(1):1--41, 2021.

[2]

Ruben Tolosana, Ruben Vera-Rodriguez, Julian Fierrez, Aythami Morales, and Javier Ortega-Garcia. Deepfakes and beyond: A survey of face manipulation and fake detection. Information Fusion, 64:131--148, 2020.

[3]

Engineering National Academies of Sciences, Medicine, et al. Implications of artificial intelligence for cybersecurity: Proceedings of a workshop. National Academies Press, 2020.

[4]

Ricardo Reimao and Vassilios Tzerpos. For: A dataset for synthetic speech detection. In 2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pages 1--10. IEEE, 2019.

[5]

John S Garofolo. Timit acoustic phonetic continuous speech corpus. Linguistic Data Consortium, 1993, 1993.

[6]

Xiaoyong Yuan, Pan He, Qile Zhu, and Xiaolin Li. Adversarial examples: Attacks and defenses for deep learning. IEEE transactions on neural networks and learning systems, 30(9):2805--2824, 2019.

[7]

Paul Taylor. Text-to-speech synthesis. Cambridge university press, 2009.

[8]

Zhen-Hua Ling, Li Deng, and Dong Yu. Modeling spectral envelopes using restricted boltzmann machines and deep belief networks for statistical parametric speech synthesis. IEEE transactions on audio, speech, and language processing, 21(10):2129--2139, 2013.

Digital Library

[9]

Shiyin Kang, Xiaojun Qian, and Helen Meng. Multi-distribution deep belief network for speech synthesis. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 8012--8016. IEEE, 2013.

[10]

Christopher M Bishop. Mixture density networks. Technical report, 1994.

[11]

Runnan Li, Zhiyong Wu, Xunying Liu, Helen Meng, and Lianhong Cai. Multi-task learning of structured output layer bidirectional lstms for speech synthesis. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5510--5514. IEEE, 2017.

Digital Library

[12]

Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135, 2017.

[13]

Douglas A Reynolds. Speaker identification and verification using gaussian mixture speaker models. Speech communication, 17(1--2):91--108, 1995.

[14]

Douglas A Reynolds, Thomas F Quatieri, and Robert B Dunn. Speaker verification using adapted gaussian mixture models. Digital signal processing, 10(1--3):19--41, 2000.

[15]

Jian Zhao, Yuan Dong, Xianyu Zhao, Hao Yang, Liang Lu, and Haila Wang. Advances in svm-based system using gmm super vectors for text-independent speaker verification. Tsinghua Science and Technology, 13(4):522--527, 2008.

[16]

Yi-Hsiang Chao, Wei-Ho Tsai, Hsin-Min Wang, and Ruei-Chuan Chang. Using kernel discriminant analysis to improve the characterization of the alternative hypothesis for speaker verification. IEEE transactions on audio, speech, and language processing, 16(8):1675--1684, 2008.

Digital Library

[17]

Sibel Yaman and Jason Pelecanos. Using polynomial kernel support vector machines for speaker verification. IEEE Signal Processing Letters, 20(9):901--904, 2013.

[18]

Abdolreza Rashno, Seyed Mohammad Ahadi, and Manoochehr Kelarestaghi. Text-independent speaker verification with ant colony optimization feature selection and support vector machine. In 2015 2nd International Conference on Pattern Recognition and Image Analysis (IPRIA), pages 1--5. IEEE, 2015.

[19]

Róis'in Loughran, Alexandros Agapitos, Ahmed Kattan, Anthony Brabazon, and Michael O'Neill. Feature selection for speaker verification using genetic programming. Evolutionary Intelligence, 10(1):1--21, 2017.

[20]

Yuan Liu, Yanmin Qian, Nanxin Chen, Tianfan Fu, Ya Zhang, and Kai Yu. Deep feature for text-dependent speaker verification. Speech Communication, 73:1--13, 2015.

Digital Library

[21]

Yong Feng, Qingyu Xiong, and Weiren Shi. Deep nonlinear metric learning for speaker verification in the i-vector space. IEICE TRANSACTIONS on Information and Systems, 100(1):215--219, 2017.

[22]

Supaporn Bunrit, Thuttaphol Inkian, Nittaya Kerdprasop, and Kittisak Kerdprasop. Text-independent speaker identification using deep learning model of convolution neural network. International Journal of Machine Learning and Computing, 9(2):143--148, 2019.

[23]

Dora M Ballesteros, Yohanna Rodriguez-Ortega, Diego Renza, and Gonzalo Arce. Deep4snet: deep learning for fake speech classification. Expert Systems with Applications, 184:115465, 2021.

Digital Library

[24]

Run Wang, Felix Juefei-Xu, Yihao Huang, Qing Guo, Xiaofei Xie, Lei Ma, and Yang Liu. Deepsonar: Towards effective and robust detection of ai-synthesized fake voices. In Proceedings of the 28th ACM International Conference on Multimedia, pages 1207--1216, 2020.

Digital Library

[25]

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition, 2015.

[26]

Ricardo Reimao and Vassilios Tzerpos. For: A dataset for synthetic speech detection. In IEEE International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pages 1--10, Timisoara, Romania, 2019.

[27]

Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, Rif A. Saurous, Yannis Agiomvrgiannakis, and Yonghui Wu. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4779--4783, 2018.

Digital Library

[28]

Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brébisson, Yoshua Bengio, and Aaron C Courville. Melgan: Generative adversarial networks for conditional waveform synthesis. Advances in neural information processing systems, 32, 2019.

[29]

Sajjad Abdoli, Patrick Cardinal, and Alessandro Lameiras Koerich. End-to-end environmental sound classification using a 1D convolutional neural network. Expert Systems with Applications, 136:252--263, 2019.

Digital Library

[30]

Ioannis Papadimitriou, Anastasios Vafeiadis, Antonios Lalas, Konstantinos Votis, and Dimitrios Tzovaras. Audio-based event detection at different snr settings using two-dimensional spectrogram magnitude representations. Electronics, 9(10):1593, 2020.

[31]

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7--9, 2015, Conference Track Proceedings, 2015.

[32]

Maged MM Fahmy. Palmprint recognition based on mel frequency cepstral coefficients feature extraction. Ain Shams Engineering Journal, 1(1):39--47, 2010.

Cited By

Akhtar ZPendyala TAthmakuri V(2024)Video and Audio Deepfake Datasets and Open Issues in Deepfake Technology: Being Ahead of the CurveForensic Sciences10.3390/forensicsci40300214:3(289-377)Online publication date: 13-Jul-2024
https://doi.org/10.3390/forensicsci4030021
Lipianina-Honcharenko KLendiuk DIvaniush ABoguta GSoia MYurkiv K(2024)An Intelligent Method for Recognizing Real and Generated Ukrainian-Language Voices2024 IEEE 5th KhPI Week on Advanced Technology (KhPIWeek)10.1109/KhPIWeek61434.2024.10877988(1-6)Online publication date: 7-Oct-2024
https://doi.org/10.1109/KhPIWeek61434.2024.10877988
Cuccovillo LGerhardt MAichroth P(2024)Audio Transformer for Synthetic Speech Detection via Formant Magnitude and Phase AnalysisICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10445932(4805-4809)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10445932
Show More Cited By

Index Terms

On the Generalizability of Two-dimensional Convolutional Neural Networks for Fake Speech Detection
1. General and reference
  1. Document types
    1. General conference proceedings

Recommendations

Fake Speech Detection Using Residual Network with Transformer Encoder
IH&MMSec '21: Proceedings of the 2021 ACM Workshop on Information Hiding and Multimedia Security

Fake speech detection aims to distinguish fake speech from natural speech. This paper presents an effective fake speech detection scheme based on residual network with transformer encoder (TE-ResNet) for improving the performance of fake speech ...
Hybrid convolutional neural networks for articulatory and acoustic information based speech recognition

Studies have shown that articulatory information helps model speech variability and, consequently, improves speech recognition performance. But learning speaker-invariant articulatory models is challenging, as speaker-specific signatures in both the ...
Dysarthric Speech Recognition Using Variational Mode Decomposition and Convolutional Neural Networks
Abstract
Dysarthric speech recognition requires a learning technique that is able to capture dysarthric speech specific features. Dysarthric speech is considered as speech with source distortion or noisy speech. Hence, as a first step speech enhancement is ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MAD '22: Proceedings of the 1st International Workshop on Multimedia AI against Disinformation

June 2022

93 pages

ISBN:9781450392426

DOI:10.1145/3512732

Program Chairs:
Bogdan Ionescu
University Politehnica of Bucharest, Romania
,
Giorgos Kordopatis-Zilos
ITI-CERTH, Greece
,
Symeon Papadopoulos
ITI-CERTH, Greece
,
Adrian Popescu
CEA LIST, France
,
Luca Cuccovillo
Fraunhofer IDMT, Germany

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 June 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

European Commission

Conference

ICMR '22

Sponsor:

SIGMM

ICMR '22: International Conference on Multimedia Retrieval

June 27 - 30, 2022

NJ, Newark, USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
203
Total Downloads

Downloads (Last 12 months)44
Downloads (Last 6 weeks)2

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Akhtar ZPendyala TAthmakuri V(2024)Video and Audio Deepfake Datasets and Open Issues in Deepfake Technology: Being Ahead of the CurveForensic Sciences10.3390/forensicsci40300214:3(289-377)Online publication date: 13-Jul-2024
https://doi.org/10.3390/forensicsci4030021
Lipianina-Honcharenko KLendiuk DIvaniush ABoguta GSoia MYurkiv K(2024)An Intelligent Method for Recognizing Real and Generated Ukrainian-Language Voices2024 IEEE 5th KhPI Week on Advanced Technology (KhPIWeek)10.1109/KhPIWeek61434.2024.10877988(1-6)Online publication date: 7-Oct-2024
https://doi.org/10.1109/KhPIWeek61434.2024.10877988
Cuccovillo LGerhardt MAichroth P(2024)Audio Transformer for Synthetic Speech Detection via Formant Magnitude and Phase AnalysisICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10445932(4805-4809)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10445932
Liz-López HKeita MTaleb-Ahmed AHadid AHuertas-Tato JCamacho D(2024)Generation and detection of manipulated multimodal audiovisual contentInformation Fusion10.1016/j.inffus.2023.102103103:COnline publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1016/j.inffus.2023.102103
Ghadekar PRajput KDhabekar HHelge PMundhra HRathi C(2023)Voice Cloning and Forgery Detection Using WaveGAN and SpecGAN2023 7th International Conference On Computing, Communication, Control And Automation (ICCUBEA)10.1109/ICCUBEA58933.2023.10392082(1-6)Online publication date: 18-Aug-2023
https://doi.org/10.1109/ICCUBEA58933.2023.10392082
Mubarak RAlsboui TAlshaikh OInuwa-Dutse IKhan SParkinson S(2023)A Survey on the Detection and Impacts of Deepfakes in Visual, Audio, and Textual FormatsIEEE Access10.1109/ACCESS.2023.334465311(144497-144529)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3344653
Cuccovillo LPapastergiopoulos CVafeiadis AYaroshchuk AAichroth PVotis KTzovaras D(2022)Open Challenges in Synthetic Speech Detection2022 IEEE International Workshop on Information Forensics and Security (WIFS)10.1109/WIFS55849.2022.9975433(1-6)Online publication date: 12-Dec-2022
https://doi.org/10.1109/WIFS55849.2022.9975433

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten