ABSTRACT
The powerful capabilities of modern text-to-speech methods to produce synthetic computer generated voice, can pose a problem in terms of discerning real from fake audio. In the present work, different pipelines were tested and the best in terms of inference time and audio quality was selected to expand on the real audio of the TIMIT dataset. This led to the creation of a new fake audio detection dataset based on the TIMIT corpus. A range of different audio representations (magnitude spectrogram and energies representations) were studied in terms of performance on both datasets, with the two-dimensional convolutional neural networks trained only on the Fake or Real (FoR) dataset. While there was not a single best representation in terms of performance on both datasets, the Mel spectrogram and Mel energies representations were found to be more robust overall. No performance difference in recognition accuracy was evident during validation, while the two-dimensional convolutional neural network model showed a tendency to under-perform on the test set of the FoR dataset and the synthesized one based on the TIMIT corpus, regardless of the representation used. This fact was corroborated by the data distribution analysis that is presented in the present work.
Supplemental Material
- Yisroel Mirsky and Wenke Lee. The creation and detection of deepfakes: A survey. ACM Computing Surveys (CSUR), 54(1):1--41, 2021.Google Scholar
- Ruben Tolosana, Ruben Vera-Rodriguez, Julian Fierrez, Aythami Morales, and Javier Ortega-Garcia. Deepfakes and beyond: A survey of face manipulation and fake detection. Information Fusion, 64:131--148, 2020.Google ScholarCross Ref
- Engineering National Academies of Sciences, Medicine, et al. Implications of artificial intelligence for cybersecurity: Proceedings of a workshop. National Academies Press, 2020.Google Scholar
- Ricardo Reimao and Vassilios Tzerpos. For: A dataset for synthetic speech detection. In 2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pages 1--10. IEEE, 2019.Google ScholarCross Ref
- John S Garofolo. Timit acoustic phonetic continuous speech corpus. Linguistic Data Consortium, 1993, 1993.Google ScholarCross Ref
- Xiaoyong Yuan, Pan He, Qile Zhu, and Xiaolin Li. Adversarial examples: Attacks and defenses for deep learning. IEEE transactions on neural networks and learning systems, 30(9):2805--2824, 2019.Google Scholar
- Paul Taylor. Text-to-speech synthesis. Cambridge university press, 2009.Google ScholarCross Ref
- Zhen-Hua Ling, Li Deng, and Dong Yu. Modeling spectral envelopes using restricted boltzmann machines and deep belief networks for statistical parametric speech synthesis. IEEE transactions on audio, speech, and language processing, 21(10):2129--2139, 2013.Google ScholarDigital Library
- Shiyin Kang, Xiaojun Qian, and Helen Meng. Multi-distribution deep belief network for speech synthesis. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 8012--8016. IEEE, 2013.Google ScholarCross Ref
- Christopher M Bishop. Mixture density networks. Technical report, 1994.Google Scholar
- Runnan Li, Zhiyong Wu, Xunying Liu, Helen Meng, and Lianhong Cai. Multi-task learning of structured output layer bidirectional lstms for speech synthesis. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5510--5514. IEEE, 2017.Google ScholarDigital Library
- Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135, 2017.Google Scholar
- Douglas A Reynolds. Speaker identification and verification using gaussian mixture speaker models. Speech communication, 17(1--2):91--108, 1995.Google Scholar
- Douglas A Reynolds, Thomas F Quatieri, and Robert B Dunn. Speaker verification using adapted gaussian mixture models. Digital signal processing, 10(1--3):19--41, 2000.Google Scholar
- Jian Zhao, Yuan Dong, Xianyu Zhao, Hao Yang, Liang Lu, and Haila Wang. Advances in svm-based system using gmm super vectors for text-independent speaker verification. Tsinghua Science and Technology, 13(4):522--527, 2008.Google ScholarCross Ref
- Yi-Hsiang Chao, Wei-Ho Tsai, Hsin-Min Wang, and Ruei-Chuan Chang. Using kernel discriminant analysis to improve the characterization of the alternative hypothesis for speaker verification. IEEE transactions on audio, speech, and language processing, 16(8):1675--1684, 2008.Google ScholarDigital Library
- Sibel Yaman and Jason Pelecanos. Using polynomial kernel support vector machines for speaker verification. IEEE Signal Processing Letters, 20(9):901--904, 2013.Google ScholarCross Ref
- Abdolreza Rashno, Seyed Mohammad Ahadi, and Manoochehr Kelarestaghi. Text-independent speaker verification with ant colony optimization feature selection and support vector machine. In 2015 2nd International Conference on Pattern Recognition and Image Analysis (IPRIA), pages 1--5. IEEE, 2015.Google ScholarCross Ref
- Róis'in Loughran, Alexandros Agapitos, Ahmed Kattan, Anthony Brabazon, and Michael O'Neill. Feature selection for speaker verification using genetic programming. Evolutionary Intelligence, 10(1):1--21, 2017.Google ScholarCross Ref
- Yuan Liu, Yanmin Qian, Nanxin Chen, Tianfan Fu, Ya Zhang, and Kai Yu. Deep feature for text-dependent speaker verification. Speech Communication, 73:1--13, 2015.Google ScholarDigital Library
- Yong Feng, Qingyu Xiong, and Weiren Shi. Deep nonlinear metric learning for speaker verification in the i-vector space. IEICE TRANSACTIONS on Information and Systems, 100(1):215--219, 2017.Google ScholarCross Ref
- Supaporn Bunrit, Thuttaphol Inkian, Nittaya Kerdprasop, and Kittisak Kerdprasop. Text-independent speaker identification using deep learning model of convolution neural network. International Journal of Machine Learning and Computing, 9(2):143--148, 2019.Google ScholarCross Ref
- Dora M Ballesteros, Yohanna Rodriguez-Ortega, Diego Renza, and Gonzalo Arce. Deep4snet: deep learning for fake speech classification. Expert Systems with Applications, 184:115465, 2021.Google ScholarDigital Library
- Run Wang, Felix Juefei-Xu, Yihao Huang, Qing Guo, Xiaofei Xie, Lei Ma, and Yang Liu. Deepsonar: Towards effective and robust detection of ai-synthesized fake voices. In Proceedings of the 28th ACM International Conference on Multimedia, pages 1207--1216, 2020.Google ScholarDigital Library
- Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition, 2015.Google Scholar
- Ricardo Reimao and Vassilios Tzerpos. For: A dataset for synthetic speech detection. In IEEE International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pages 1--10, Timisoara, Romania, 2019.Google ScholarCross Ref
- Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, Rif A. Saurous, Yannis Agiomvrgiannakis, and Yonghui Wu. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4779--4783, 2018.Google ScholarDigital Library
- Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brébisson, Yoshua Bengio, and Aaron C Courville. Melgan: Generative adversarial networks for conditional waveform synthesis. Advances in neural information processing systems, 32, 2019.Google Scholar
- Sajjad Abdoli, Patrick Cardinal, and Alessandro Lameiras Koerich. End-to-end environmental sound classification using a 1D convolutional neural network. Expert Systems with Applications, 136:252--263, 2019.Google ScholarDigital Library
- Ioannis Papadimitriou, Anastasios Vafeiadis, Antonios Lalas, Konstantinos Votis, and Dimitrios Tzovaras. Audio-based event detection at different snr settings using two-dimensional spectrogram magnitude representations. Electronics, 9(10):1593, 2020.Google ScholarCross Ref
- Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7--9, 2015, Conference Track Proceedings, 2015.Google Scholar
- Maged MM Fahmy. Palmprint recognition based on mel frequency cepstral coefficients feature extraction. Ain Shams Engineering Journal, 1(1):39--47, 2010.Google ScholarCross Ref
Index Terms
- On the Generalizability of Two-dimensional Convolutional Neural Networks for Fake Speech Detection
Recommendations
Fake Speech Detection Using Residual Network with Transformer Encoder
IH&MMSec '21: Proceedings of the 2021 ACM Workshop on Information Hiding and Multimedia SecurityFake speech detection aims to distinguish fake speech from natural speech. This paper presents an effective fake speech detection scheme based on residual network with transformer encoder (TE-ResNet) for improving the performance of fake speech ...
Hybrid convolutional neural networks for articulatory and acoustic information based speech recognition
Studies have shown that articulatory information helps model speech variability and, consequently, improves speech recognition performance. But learning speaker-invariant articulatory models is challenging, as speaker-specific signatures in both the ...
Dysarthric Speech Recognition Using Variational Mode Decomposition and Convolutional Neural Networks
AbstractDysarthric speech recognition requires a learning technique that is able to capture dysarthric speech specific features. Dysarthric speech is considered as speech with source distortion or noisy speech. Hence, as a first step speech enhancement is ...
Comments