Source and System-Based Modulation Approach for Fake Speech Detection

Sadashiv T. N., Rishith; Kumar, Devesh; Agarwal, Ayush; Tzudir, Moakala; Mishra, Jagabandhu; Prasanna, S. R. Mahadeva

doi:10.1007/978-3-031-48309-7_12

Rishith Sadashiv T. N.¹³,
Devesh Kumar¹³,
Ayush Agarwal¹⁴,
Moakala Tzudir¹³,
Jagabandhu Mishra¹³ &
…
S. R. Mahadeva Prasanna¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14338))

Included in the following conference series:

International Conference on Speech and Computer

815 Accesses

Abstract

The advancement of deep learning technology in speech generation has made fake speech almost perceptually indistinguishable from real speech. Most of the attempts in literature are dataset dependent and fail to detect fake speech in domain variability or cross-dataset scenarios. This study explores the potential of excitation source features to detect fake speech in domain variable conditions. Motivated by the distinction observed using excitation source information, this work proposes a new feature called residual modulation spectrogram for fake speech detection. ResNet-34 is used for the binary classification task to distinguish between fake and real speech. The modulation spectrogram is used as the baseline feature. The proposed approach performs well in domain variability in most cases and shows generalizability across different datasets. Additionally, the score-level combination of residual modulation spectrogram and modulation spectrogram shows enhanced performance. This justifies the efficacy of source and system information in domain variability scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Fake Speech Detection Using Modulation Spectrogram

Fake Speech Detection Using OpenSMILE Features

Self-distillation framework for improving fake speech detection in the domain variability scenario

Article 11 December 2024

References

Fraudsters Used AI to Mimic CEO’s Voice in Unusual Cybercrime Case. https://www.wsj.com/articles/fraudsters-use-ai-to-mimic-ceos-voice-in-unusual-cybercrime-case-11567157402
Alzantot, M., Wang, Z., Srivastava, M.B.: Deep residual neural networks for audio spoofing detection. arXiv preprint arXiv:1907.00501 (2019)
Ballesteros, D.M., Rodriguez-Ortega, Y., Renza, D., Arce, G.: Deep4snet: deep learning for fake speech classification. Expert Syst. Appl. 184, 115465 (2021)
Article Google Scholar
Black, A.W.: CMU wilderness multilingual speech dataset. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5971–5975. IEEE (2019)
Google Scholar
Cassani, R., Albuquerque, I., Monteiro, J., Falk, T.H.: AMA: an open-source amplitude modulation analysis toolkit for signal processing applications. In: 2019 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pp. 1–4 (2019). https://doi.org/10.1109/GlobalSIP45357.2019.8969210
Chintha, A., et al.: Recurrent convolutional structures for audio spoof and video deepfake detection. IEEE J. Selected Topics Signal Process. 14(5), 1024–1037 (2020)
Article Google Scholar
Fang, X., et al.: Semi-supervised end-to-end fake speech detection method based on time-domain waveforms. J. Comput. Appl. 43(1), 227 (2023)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hore, A., Ziou, D.: Image quality metrics: PSNR vs. SSIM. In: 2010 20th International Conference on Pattern Recognition, pp. 2366–2369. IEEE (2010)
Google Scholar
Ito, K., Johnson, L.: The LJ speech dataset. https://keithito.com/LJ-Speech-Dataset/ (2017)
Khochare, J., Joshi, C., Yenarkar, B., Suratkar, S., Kazi, F.: A deep learning framework for audio deepfake detection. Arab. J. Sci. Eng. 47(3), 3447–3458 (2021). https://doi.org/10.1007/s13369-021-06297-w
Article Google Scholar
Kumar, D., Patil, P.K.V., Agarwal, A., Prasanna, S.R.M.: Fake speech detection using OpenSMILE features. In: Prasanna, S.R.M., Karpov, A., Samudravijaya, K., Agrawal, S.S. (eds.) Speech and Computer: 24th International Conference, SPECOM 2022, Gurugram, India, November 14–16, 2022, Proceedings, pp. 404–415. Springer International Publishing, Cham (2022). https://doi.org/10.1007/978-3-031-20980-2_35
Chapter Google Scholar
Lei, Z., Yang, Y., Liu, C., Ye, J.: Siamese convolutional neural network using gaussian probability feature for spoofing speech detection. In: INTERSPEECH, pp. 1116–1120 (2020)
Google Scholar
Magazine, R., Agarwal, A., Hedge, A., Prasanna, S.M.: Fake speech detection using modulation spectrogram. In: Speech and Computer: 24th International Conference, SPECOM 2022, Gurugram, India, November 14–16, 2022, Proceedings. pp. 451–463. Springer (2022). https://doi.org/10.1007/978-3-031-20980-2_39
Makhoul, J.: Linear prediction: a tutorial review. Proc. IEEE 63(4), 561–580 (1975)
Article Google Scholar
Mishra, J., Pati, D., Prasanna, S.M.: Modelling glottal flow derivative signal for detection of replay speech samples. In: 2019 National Conference on Communications (NCC), pp. 1–5. IEEE (2019)
Google Scholar
Mishra, J., Singh, M., Pati, D.: LP residual features to counter replay attacks. In: 2018 International Conference on Signals and Systems (ICSigSys), pp. 261–266. IEEE (2018)
Google Scholar
Mishra, J., Singh, M., Pati, D.: Processing linear prediction residual signal to counter replay attacks. In: 2018 International Conference on Signal Processing and Communications (SPCOM), pp. 95–99. IEEE (2018)
Google Scholar
Ning, Y., He, S., Wu, Z., Xing, C., Zhang, L.J.: A review of deep learning based speech synthesis. Appl. Sci. 9(19), 4050 (2019)
Article Google Scholar
Prasanna, S.M., Gupta, C.S., Yegnanarayana, B.: Extraction of speaker-specific excitation information from linear prediction residual of speech. Speech Commun. 48(10), 1243–1261 (2006)
Article Google Scholar
Reimao, R., Tzerpos, V.: For: a dataset for synthetic speech detection. In: 2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pp. 1–10. IEEE (2019)
Google Scholar
Siddhartha, S., Mishra, J., Prasanna, S.M.: Language specific information from LP residual signal using linear sub band filters. In: 2020 National Conference on Communications (NCC), pp. 1–5. IEEE (2020)
Google Scholar
Todisco, M., et al.: Asvspoof 2019: future horizons in spoofed and fake audio detection. arXiv preprint arXiv:1904.05441 (2019)
Wang, C., et al.: Fully automated end-to-end fake audio detection. In: Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia, pp. 27–33 (2022)
Google Scholar
Wijethunga, R., Matheesha, D., Al Noman, A., De Silva, K., Tissera, M., Rupasinghe, L.: Deepfake audio detection: a deep learning based solution for group conversations. In: 2020 2nd International Conference on Advancements in Computing (ICAC). vol. 1, pp. 192–197. IEEE (2020)
Google Scholar
Wu, Z., et al.: ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge. In: INTERSPEECH, pp. 2037–2041 (2015)
Google Scholar
Yamamoto, R., Song, E., Kim, J.M.: Parallel wavegan: a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199–6203. IEEE (2020)
Google Scholar
Yi, J., et al.: Add 2022: the first audio deep synthesis detection challenge. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 9216–9220. IEEE (2022)
Google Scholar
Zen, H., et al.: Libritts: a corpus derived from librispeech for text-to-speech. arXiv preprint arXiv:1904.02882 (2019)

Download references

Acknowledgement

The authors would like to acknowledge the Ministry of Electronics and Information Technology (MeitY), Government of India for funding this research work and “Anantganak”, the high performance computation (HPC) facility, IIT Dharwad, for enabling us to perform our experiments.

Author information

Authors and Affiliations

Indian Institute of Technology Dharwad, Dharwad, 580011, India
Rishith Sadashiv T. N., Devesh Kumar, Moakala Tzudir, Jagabandhu Mishra & S. R. Mahadeva Prasanna
McAfee, Bengaluru, India
Ayush Agarwal

Authors

Rishith Sadashiv T. N.
View author publications
You can also search for this author in PubMed Google Scholar
Devesh Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Ayush Agarwal
View author publications
You can also search for this author in PubMed Google Scholar
Moakala Tzudir
View author publications
You can also search for this author in PubMed Google Scholar
Jagabandhu Mishra
View author publications
You can also search for this author in PubMed Google Scholar
S. R. Mahadeva Prasanna
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rishith Sadashiv T. N. .

Editor information

Editors and Affiliations

St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Koneru Lakshmaiah Education Foundation, Vaddeswaram, India
K. Samudravijaya
Indian Institute of Information Technology Dharwad, Dharwad, India
K. T. Deepak
Indian Institute of Technology Dharwad, Dharwad, India
Rajesh M. Hegde
KIIT Group of Colleges, Gurugram, India
Shyam S. Agrawal
Indian Institute of Technology Dharwad, Dharwad, India
S. R. Mahadeva Prasanna

Ethics declarations

All views and data related to information technology, and anything deemed to be “cyber security” are made on behalf of the authors of this paper and not on behalf of McAfee.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sadashiv T. N., R., Kumar, D., Agarwal, A., Tzudir, M., Mishra, J., Prasanna, S.R.M. (2023). Source and System-Based Modulation Approach for Fake Speech Detection. In: Karpov, A., Samudravijaya, K., Deepak, K.T., Hegde, R.M., Agrawal, S.S., Prasanna, S.R.M. (eds) Speech and Computer. SPECOM 2023. Lecture Notes in Computer Science(), vol 14338. Springer, Cham. https://doi.org/10.1007/978-3-031-48309-7_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-48309-7_12
Published: 22 November 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-48308-0
Online ISBN: 978-3-031-48309-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Source and System-Based Modulation Approach for Fake Speech Detection