Abstract
Nowadays, speech technology like automatic speaker verification (ASV) systems can accurately verify the speaker’s identity, and hence they are extensively used in biometrics and banks. With the advancements in deep learning, deepFake has become the primary threat to these ASV systems. The researchers keep proposing methods to generate speech with characteristics indistinguishable from the original speech. Various techniques exist that perform fake speech detection, but these methods are oriented toward a specific dataset or the source of the generation of fake speech. In this work, we propose a modulation spectrogram-based fake speech detection. We show the ability of the modulation spectrogram to classify when there is speaker, session, gender, domain, and source of generation variation. The proposed approach is evaluated on CMU-arctic, LJ Speech, and LibreTTS datasets, and classification accuracy is reported. The accuracy score shows that the proposed approach can classify fake speech.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
References
Agarwal, A., Swain, A., Prasanna, S.R.M.: Speaker anonymization for machines using sinusoidal model. In: 2022 IEEE International Conference on Signal Processing and Communications (SPCOM). IEEE (2022)
Agarwal, A., et al.: Significance of prosody modification in privacy preservation on speaker verification. In: 2022 National Conference on Communications (NCC). IEEE (2022)
Black, A.W.: CMU wilderness multilingual speech dataset. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE (2019)
Balamurali, B.T., et al.: Toward robust audio spoofing detection: a detailed comparison of traditional and learned features. IEEE Access 7, 84229–84241 (2019)
Gao, Y., et al.: Detection and evaluation of human and machine generated speech in spoofing attacks on automatic speaker verification systems. In: 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE (2021)
Greenberg, S., Kingsbury, B.E.D.: The modulation spectrogram: in pursuit of an invariant representation of speech. In: 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 3. IEEE (1997)
He, K., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Hore, A., Ziou, D.: Image quality metrics: PSNR vs. SSIM. In: 2010 20th International Conference on Pattern Recognition. IEEE (2010)
Jung, J., et al.: SASV challenge 2022: a spoofing aware speaker verification challenge evaluation plan. arXiv preprint arXiv:2201.10283 (2022)
Kawahara, H.: Speech representation and transformation using adaptive interpolation of weighted spectrum: vocoder revisited. In: 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2. IEEE (1997)
Kinnunen, T., et al.: The ASVspoof 2017 challenge: assessing the limits of replay spoofing attack detection (2017)
Kong, J., Kim, J., Bae, J.: HIFI-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. Adv. Neural. Inf. Process. Syst. 33, 17022–17033 (2020)
Kumar, K., et al.: MelGAN: generative adversarial networks for conditional waveform synthesis. In: Advances in Neural Information Processing Systems,vol. 32 (2019)
Ito, K.: The LJ speech dataset (2017). https://keithito.com/LJ-Speech-Dataset/
Todisco, M., et al.: ASVspoof 2019: future horizons in spoofed and fake audio detection. arXiv preprint arXiv:1904.05441 (2019)
Wu, Z., et al.: ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)
Wu, Z., et al.: Synthetic speech detection using temporal modulation feature. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE (2013)
Yamamoto, R., Song, E., Kim, J.-M.: Parallel WaveGAN: a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE (2020)
Yang, G., et al.: Multi-band MelGAN: faster waveform generation for high-quality text-to-speech. In: 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE (2021)
Zen, H., et al.: LibriTTS: a corpus derived from LibriSpeech for text-to-speech. arXiv preprint arXiv:1904.02882 (2019)
Acknowledgments
This work is funded by Ministry of Electronics and Information Technology (MeitY), Govt. of India under the project title “Fake Speech detection using Deep Learning Framework”.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Magazine, R., Agarwal, A., Hedge, A., Prasanna, S.R.M. (2022). Fake Speech Detection Using Modulation Spectrogram. In: Prasanna, S.R.M., Karpov, A., Samudravijaya, K., Agrawal, S.S. (eds) Speech and Computer. SPECOM 2022. Lecture Notes in Computer Science(), vol 13721. Springer, Cham. https://doi.org/10.1007/978-3-031-20980-2_39
Download citation
DOI: https://doi.org/10.1007/978-3-031-20980-2_39
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20979-6
Online ISBN: 978-3-031-20980-2
eBook Packages: Computer ScienceComputer Science (R0)