Fake Speech Recognition Using Deep Learning

Camacho, Steven; Ballesteros, Dora Maria; Renza, Diego

doi:10.1007/978-3-030-86702-7_4

Fake Speech Recognition Using Deep Learning

Steven Camacho⁹,
Dora Maria Ballesteros⁹ &
Diego Renza⁹

Conference paper
First Online: 29 September 2021

1349 Accesses
9 Citations

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1431))

Abstract

The increase in the number of algorithms and commercial tools for creating synthetic audio has led to a high level of misinformation, especially on social media. As a consequence, efforts have been focused in recent years on detecting this type of content. However, this task is far from being successfully addressed, as the naturalness of fake audios is increasing. In this paper we present a model to classify audios between natural and fake, using an audio preparation stage that includes raw audio transformation, and a modelling stage by means of a custom Convolutional Neural Network (CNN) architecture. Our model is trained on data from the FoR dataset, which contains natural and synthetic audios obtained from several algorithms for deepfake content generation. The performance of the model is evaluated with different metrics such as F1 score, precision (P) and recall (R). According to the results, the audios are successfully classified in 88.9% of the cases.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Kietzmann, J., Lee, L.W., McCarthy, I.P., Kietzmann, T.C.: DeepFakes: trick or treat? Bus. Horiz. 63(2), 135–146 (2020)
Article Google Scholar
Paris, B., Donovan, J.: Deepfakes and cheap fakes. Data Soc. 47 (2019)
Google Scholar
Ahmed, S.: Who inadvertently shares deepfakes? Analyzing the role of political interest, cognitive ability, and social network size. Telemat. Inf. 57, 101508 (2021)
Google Scholar
Lieto, A., et al.: Hello? Who am i talking to? A shallow CNN approach for human vs. bot speech classification. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2019, pp. 2577–2581 (2019)
Google Scholar
Yu, P., Xia, Z., Fei, J., Lu, Y.: A survey on deepfake video detection. IET Biomet. (2021)
Google Scholar
Guera, D., Delp, E.J.: Deepfake video detection using recurrent neural networks. In: Proceedings of AVSS 2018–2018 15th IEEE International Conference on Advanced Video and Signal-Based Surveillance, pp. 1–6 (2019)
Google Scholar
Dolhansky, B., Bitton, J., Pflaum, B., Lu, J., Howes, R., Wang, M., Ferrer, C.C.: The deepfake detection challenge dataset. arXiv preprint arXiv:2006.07397 (2020)
Lyu, S.: Deepfake detection: Current challenges and next steps, pp. 1–6 (2020)
Google Scholar
Nguyen, T.T., Nguyen, C.M., Nguyen, D.T., Nguyen, D.T., Nahavandi, S.: Deep Learning for Deepfakes Creation and Detection: A Survey, pp. 1–12 (2019)
Google Scholar
van den Oord, A., et al.: WaveNet: A Generative Model for Raw Audio, pp. 1–15 (2016)
Google Scholar
Elias, I., et al.: Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling (2021)
Google Scholar
Saito, Y., Takamichi, S., Saruwatari, H.: Vocoder-free text-to-speech synthesis incorporating generative adversarial networks using low-/multi-frequency STFT amplitude spectra. Comput. Speech Lang. 58, 347–363 (2019)
Article Google Scholar
Arik, S., et al.: Deep voice: real-time neural text-to-speech. In: 34th International Conference on Machine Learning, ICML 2017, vol. 1, pp. 264–273 (2017)
Google Scholar
Arik, S.O., et al.: Deep voice 2: multi-speaker neural text-to-speech. In: Advances in Neural Information Processing Systems, vol. 2017, pp. 2963–2971 (2017)
Google Scholar
Ping, W., et al.: Deep voice 3: scaling text-to-speech with convolutional sequence learning. In: 6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings, pp. 1–16 (2018)
Google Scholar
Zhu, X., Xue, L.: Building a controllable expressive speech synthesis system with multiple emotion strengths. Cogn. Syst. Res. 59, 151–159 (2020)
Article Google Scholar
Maiti, S., Marchi, E., Conkie, A.: Generating multilingual voices using speaker space translation based on bilingual speaker data. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7624–7628. IEEE (2020)
Google Scholar
Zhao, Y., et al.: Voice conversion challenge 2020: intra-lingual semi-parallel and cross-lingual voice conversion. arXiv preprint arXiv:2008.12527 (2020)
Sisman, B., Yamagishi, J., Member, S., King, S.: An Overview of Voice Conversion and its Challenges: From Statistical Modeling to Deep Learning, pp. 1–27 (2008)
Google Scholar
Mohammadi, S.H., Kain, A.: An overview of voice conversion systems. Speech Commun. 88, 65–82 (2017)
Article Google Scholar
Canton, C., Brian Dolhansky, J.B., Ben Pflaum, J.P., Lu, J.: Deepfake detection challenge results: An open initiative to advance AI, June 2020https://ai.facebook.com/blog/deepfake-detection-challenge-results-an-open-initiative-to-advance-ai/
Héctor, N., Tomi, K., Xuechen, A., Jose, M.S., Massimiliano, X.W., Junichi. ASVSPOOF 2021: Automatic speaker verification spoofing and countermeasures challenge evaluation plan (2021)
Google Scholar
Reimao, R., Tzerpos, V.: FoR: a dataset for synthetic speech detection. In: 2019 10th International Conference on Speech Technology and Human-Computer Dialogue, SpeD 2019 (2019)
Google Scholar
Ballesteros, D.M., Rodriguez, Y., Renza, D.: A dataset of histograms of original and fake voice recordings (h-voice). Data Brief 29, 105331 (2020)
Google Scholar
Rodriguez, Y., Ballesteros, D.M., Renza, S.: Fake voice recordings (imitation), November 2019
Google Scholar
Wang, R., et al.: DeepSonar: Towards Effective and Robust Detection of AI-Synthesized Fake Voices (2020)
Google Scholar
AlBadawy, E.A., Lyu, S., Farid, H.: Detecting AI-synthesized speech using bispectral analysis. In: CVPR Workshops, pp. 104–109 (2019)
Google Scholar
Chen, T., Kumar, A., Nagarsheth, P., Sivaraman, G., Khoury, E.: Generalization of audio deepfake detection. In: Proceedings of the Odyssey Speaker and Language Recognition Workshop, Tokyo, Japan, pp. 1–5 (2020)
Google Scholar
Gao, Y., Vuong, T., Elyasi, M., Bharaj, G., Singh, R., et al.: Generalized spoofing detection inspired from audio generation artifacts. arXiv preprint arXiv:2104.04111 (2021)
Ballesteros, D.M., Rodriguez-Ortega, Y., Renza, D., Arce, G.: Deep4SNet: deep learning for fake speech classification. Expert Syst. Appl. 184, 115465 (2021)
Google Scholar
Rodríguez-Ortega, Y., Ballesteros, D.M., Renza, D.: A machine learning model to detect fake voice. In: Florez, H., Misra, S. (eds.) ICAI 2020. CCIS, vol. 1277, pp. 3–13. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61702-8_1
Chapter Google Scholar

Download references

Acknowledgment

This work is supported by the “Universidad Militar Nueva Granada - Vicerrectoría de Investigaciones” under the grant IMP-ING-2936 of 2019.

Author information

Authors and Affiliations

Universidad Militar Nueva Granada, Bogotá, Colombia
Steven Camacho, Dora Maria Ballesteros & Diego Renza

Authors

Steven Camacho
View author publications
You can also search for this author in PubMed Google Scholar
Dora Maria Ballesteros
View author publications
You can also search for this author in PubMed Google Scholar
Diego Renza
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Steven Camacho .

Editor information

Editors and Affiliations

Universidad Distrital Francisco José de Caldas, Bogotá, Colombia
Juan Carlos Figueroa-García
Universidad Santo Tomás de Aquino, Bogotá, Colombia
Yesid Díaz-Gutierrez
Universidad Distrital Francisco José de Caldas, Bogotá, Colombia
Elvis Eduardo Gaona-García
Universidad del Rosario, Bogotá, Colombia
Alvaro David Orjuela-Cañón

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Camacho, S., Ballesteros, D.M., Renza, D. (2021). Fake Speech Recognition Using Deep Learning. In: Figueroa-García, J.C., Díaz-Gutierrez, Y., Gaona-García, E.E., Orjuela-Cañón, A.D. (eds) Applied Computer Sciences in Engineering. WEA 2021. Communications in Computer and Information Science, vol 1431. Springer, Cham. https://doi.org/10.1007/978-3-030-86702-7_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-86702-7_4
Published: 29 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86701-0
Online ISBN: 978-3-030-86702-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics