Skip to main content

Source and System-Based Modulation Approach for Fake Speech Detection

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14338))

Included in the following conference series:

Abstract

The advancement of deep learning technology in speech generation has made fake speech almost perceptually indistinguishable from real speech. Most of the attempts in literature are dataset dependent and fail to detect fake speech in domain variability or cross-dataset scenarios. This study explores the potential of excitation source features to detect fake speech in domain variable conditions. Motivated by the distinction observed using excitation source information, this work proposes a new feature called residual modulation spectrogram for fake speech detection. ResNet-34 is used for the binary classification task to distinguish between fake and real speech. The modulation spectrogram is used as the baseline feature. The proposed approach performs well in domain variability in most cases and shows generalizability across different datasets. Additionally, the score-level combination of residual modulation spectrogram and modulation spectrogram shows enhanced performance. This justifies the efficacy of source and system information in domain variability scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Fraudsters Used AI to Mimic CEO’s Voice in Unusual Cybercrime Case. https://www.wsj.com/articles/fraudsters-use-ai-to-mimic-ceos-voice-in-unusual-cybercrime-case-11567157402

  2. Alzantot, M., Wang, Z., Srivastava, M.B.: Deep residual neural networks for audio spoofing detection. arXiv preprint arXiv:1907.00501 (2019)

  3. Ballesteros, D.M., Rodriguez-Ortega, Y., Renza, D., Arce, G.: Deep4snet: deep learning for fake speech classification. Expert Syst. Appl. 184, 115465 (2021)

    Article  Google Scholar 

  4. Black, A.W.: CMU wilderness multilingual speech dataset. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5971–5975. IEEE (2019)

    Google Scholar 

  5. Cassani, R., Albuquerque, I., Monteiro, J., Falk, T.H.: AMA: an open-source amplitude modulation analysis toolkit for signal processing applications. In: 2019 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pp. 1–4 (2019). https://doi.org/10.1109/GlobalSIP45357.2019.8969210

  6. Chintha, A., et al.: Recurrent convolutional structures for audio spoof and video deepfake detection. IEEE J. Selected Topics Signal Process. 14(5), 1024–1037 (2020)

    Article  Google Scholar 

  7. Fang, X., et al.: Semi-supervised end-to-end fake speech detection method based on time-domain waveforms. J. Comput. Appl. 43(1), 227 (2023)

    Google Scholar 

  8. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  9. Hore, A., Ziou, D.: Image quality metrics: PSNR vs. SSIM. In: 2010 20th International Conference on Pattern Recognition, pp. 2366–2369. IEEE (2010)

    Google Scholar 

  10. Ito, K., Johnson, L.: The LJ speech dataset. https://keithito.com/LJ-Speech-Dataset/ (2017)

  11. Khochare, J., Joshi, C., Yenarkar, B., Suratkar, S., Kazi, F.: A deep learning framework for audio deepfake detection. Arab. J. Sci. Eng. 47(3), 3447–3458 (2021). https://doi.org/10.1007/s13369-021-06297-w

    Article  Google Scholar 

  12. Kumar, D., Patil, P.K.V., Agarwal, A., Prasanna, S.R.M.: Fake speech detection using OpenSMILE features. In: Prasanna, S.R.M., Karpov, A., Samudravijaya, K., Agrawal, S.S. (eds.) Speech and Computer: 24th International Conference, SPECOM 2022, Gurugram, India, November 14–16, 2022, Proceedings, pp. 404–415. Springer International Publishing, Cham (2022). https://doi.org/10.1007/978-3-031-20980-2_35

    Chapter  Google Scholar 

  13. Lei, Z., Yang, Y., Liu, C., Ye, J.: Siamese convolutional neural network using gaussian probability feature for spoofing speech detection. In: INTERSPEECH, pp. 1116–1120 (2020)

    Google Scholar 

  14. Magazine, R., Agarwal, A., Hedge, A., Prasanna, S.M.: Fake speech detection using modulation spectrogram. In: Speech and Computer: 24th International Conference, SPECOM 2022, Gurugram, India, November 14–16, 2022, Proceedings. pp. 451–463. Springer (2022). https://doi.org/10.1007/978-3-031-20980-2_39

  15. Makhoul, J.: Linear prediction: a tutorial review. Proc. IEEE 63(4), 561–580 (1975)

    Article  Google Scholar 

  16. Mishra, J., Pati, D., Prasanna, S.M.: Modelling glottal flow derivative signal for detection of replay speech samples. In: 2019 National Conference on Communications (NCC), pp. 1–5. IEEE (2019)

    Google Scholar 

  17. Mishra, J., Singh, M., Pati, D.: LP residual features to counter replay attacks. In: 2018 International Conference on Signals and Systems (ICSigSys), pp. 261–266. IEEE (2018)

    Google Scholar 

  18. Mishra, J., Singh, M., Pati, D.: Processing linear prediction residual signal to counter replay attacks. In: 2018 International Conference on Signal Processing and Communications (SPCOM), pp. 95–99. IEEE (2018)

    Google Scholar 

  19. Ning, Y., He, S., Wu, Z., Xing, C., Zhang, L.J.: A review of deep learning based speech synthesis. Appl. Sci. 9(19), 4050 (2019)

    Article  Google Scholar 

  20. Prasanna, S.M., Gupta, C.S., Yegnanarayana, B.: Extraction of speaker-specific excitation information from linear prediction residual of speech. Speech Commun. 48(10), 1243–1261 (2006)

    Article  Google Scholar 

  21. Reimao, R., Tzerpos, V.: For: a dataset for synthetic speech detection. In: 2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pp. 1–10. IEEE (2019)

    Google Scholar 

  22. Siddhartha, S., Mishra, J., Prasanna, S.M.: Language specific information from LP residual signal using linear sub band filters. In: 2020 National Conference on Communications (NCC), pp. 1–5. IEEE (2020)

    Google Scholar 

  23. Todisco, M., et al.: Asvspoof 2019: future horizons in spoofed and fake audio detection. arXiv preprint arXiv:1904.05441 (2019)

  24. Wang, C., et al.: Fully automated end-to-end fake audio detection. In: Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia, pp. 27–33 (2022)

    Google Scholar 

  25. Wijethunga, R., Matheesha, D., Al Noman, A., De Silva, K., Tissera, M., Rupasinghe, L.: Deepfake audio detection: a deep learning based solution for group conversations. In: 2020 2nd International Conference on Advancements in Computing (ICAC). vol. 1, pp. 192–197. IEEE (2020)

    Google Scholar 

  26. Wu, Z., et al.: ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge. In: INTERSPEECH, pp. 2037–2041 (2015)

    Google Scholar 

  27. Yamamoto, R., Song, E., Kim, J.M.: Parallel wavegan: a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199–6203. IEEE (2020)

    Google Scholar 

  28. Yi, J., et al.: Add 2022: the first audio deep synthesis detection challenge. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 9216–9220. IEEE (2022)

    Google Scholar 

  29. Zen, H., et al.: Libritts: a corpus derived from librispeech for text-to-speech. arXiv preprint arXiv:1904.02882 (2019)

Download references

Acknowledgement

The authors would like to acknowledge the Ministry of Electronics and Information Technology (MeitY), Government of India for funding this research work and “Anantganak”, the high performance computation (HPC) facility, IIT Dharwad, for enabling us to perform our experiments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rishith Sadashiv T. N. .

Editor information

Editors and Affiliations

Ethics declarations

All views and data related to information technology, and anything deemed to be “cyber security” are made on behalf of the authors of this paper and not on behalf of McAfee.

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sadashiv T. N., R., Kumar, D., Agarwal, A., Tzudir, M., Mishra, J., Prasanna, S.R.M. (2023). Source and System-Based Modulation Approach for Fake Speech Detection. In: Karpov, A., Samudravijaya, K., Deepak, K.T., Hegde, R.M., Agrawal, S.S., Prasanna, S.R.M. (eds) Speech and Computer. SPECOM 2023. Lecture Notes in Computer Science(), vol 14338. Springer, Cham. https://doi.org/10.1007/978-3-031-48309-7_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-48309-7_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-48308-0

  • Online ISBN: 978-3-031-48309-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics