Abstract
The proliferation of algorithms and commercial tools for the creation of synthetic audio has resulted in a significant increase in the amount of inaccurate information, particularly on social media platforms. As a direct result of this, efforts have been concentrated in recent years on identifying the presence of content of this kind. Despite this, there is still a long way to go until this problem is adequately addressed because of the growing naturalness of fake or synthetic audios. In this study, we proposed different networks configurations: a Custom Convolution Neural Network (cCNN) and two pretrained models (VGG16 and MobileNet) as well as end-to-end models to classify real and fake audios. An extensive experimental analysis was carried out on three classes of audio manipulation of the dataset FoR deepfake audio dataset. Also, we combined such sub-datasets to formulate a combined dataset FoR-combined to enhance the performance of the models. The experimental analysis shows that the proposed cCNN outperforms all the baseline models and other reference works with the highest accuracy of 97.23% on FoR-combined and sets new benchmarks for the datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Masood, M., Nawaz, M., Malik, K.M., Javed, A., Irtaza, A., Malik, H.: Deepfakes generation and detection: State-of-the-art, open challenges, countermeasures, and way forward. Appl. Intell. 53(4), 3974–4026 (2023)
Akhtar, Z.: Deepfakes generation and detection: a short survey. J. Imaging 9(1), 18 (2023)
Malik, K.M., Malik, H., Baumann, R.: Towards vulnerability analysis of voice-driven interfaces and countermeasures for replay attacks. In: 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pp. 523–528. IEEE (2019)
Khanjani, Z., Watson, G., Janeja, V.P.: Audio deepfakes: a survey. Front. Big Data 5, 1001063 (2023). https://doi.org/10.3389/fdata.2022.1001063
Aljasem, M., et al.: Secure automatic speaker verification (SASV) system through SM-ALTP features and asymmetric bagging. IEEE Trans. Inf. Forensics Secur. 16, 3524–3537 (2021)
Firc, A., Malinka, K., Hanácek, P.: Deepfakes as a threat to a speaker and facial recognition: an overview of tools and attack vectors. Heliyon 9(4), e15090 (2023). https://doi.org/10.1016/j.heliyon.2023.e15090
Todisco, M., et al.: ASVspoof 2019: future horizons in spoofed and fake audio detection. arXiv preprint arXiv:1904.05441 (2019)
Reimao, R., Tzerpos, V.: For: A dataset for synthetic speech detection. In: 2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pp. 1–10. IEEE (2019)
Khan, A., Sohail, A., Zahoora, U., Qureshi, A.S.: A survey of the recent architectures of deep convolutional neural networks. Artif. Intell. Rev. 53, 5455–5516 (2020)
Wang, R., et al.: Deepsonar: towards effective and robust detection of ai-synthesized fake voices. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1207–1216 (2020)
Camacho, S., Ballesteros, D.M., Renza, D.: Fake speech recognition using deep learning. In: Figueroa-García, J.C., Díaz-Gutierrez, Y., Gaona-García, E.E., Orjuela-Cañón, A.D. (eds.) Applied Computer Sciences in Engineering: 8th Workshop on Engineering Applications, WEA 2021, Medellín, Colombia, October 6–8, 2021, Proceedings, pp. 38–48. Springer International Publishing, Cham (2021). https://doi.org/10.1007/978-3-030-86702-7_4
Khochare, J., Joshi, C., Yenarkar, B., Suratkar, S., Kazi, F.: A deep learning framework for audio deepfake detection. Arab. J. Sci. Eng. 47(3), 3447–3458 (2021). https://doi.org/10.1007/s13369-021-06297-w
Iqbal, F., Abbasi, A., Javed, A.R., Jalil, Z., Al-Karaki, J.: Deepfake Audio Detection via Feature Engineering and Machine Learning (2022)
Hamza, A., et al.: Deepfake audio detection via MFCC features using machine learning. IEEE Access 10, 134018–134028 (2022)
Guha, S., Das, A., Singh, P.K., Ahmadian, A., Senu, N., Sarkar, R.: Hybrid feature selection method based on harmony search and naked mole-rat algorithms for spoken language identification from audio signals. IEEE Access 8, 182868–182887 (2020)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Howard, A.G., et al.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
Alabdulmohsin, I., Maennel, H., Keysers, D.: The impact of reinitialization on generalization in convolutional neural networks. arXiv preprint arXiv:2109.00267 2021
Acknowledgements
This study has been partially supported by SERICS (PE00000014) under the MUR National Recovery and Resilience Plan funded by the European Union – NextGenerationEU and Sapienza University of Rome project 2022–2024 “EV2” (003 009 22).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Wani, T.M., Amerini, I. (2023). Deepfakes Audio Detection Leveraging Audio Spectrogram and Convolutional Neural Networks. In: Foresti, G.L., Fusiello, A., Hancock, E. (eds) Image Analysis and Processing – ICIAP 2023. ICIAP 2023. Lecture Notes in Computer Science, vol 14234. Springer, Cham. https://doi.org/10.1007/978-3-031-43153-1_14
Download citation
DOI: https://doi.org/10.1007/978-3-031-43153-1_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43152-4
Online ISBN: 978-3-031-43153-1
eBook Packages: Computer ScienceComputer Science (R0)