Skip to main content

Multiresolution Decomposition Analysis via Wavelet Transforms for Audio Deepfake Detection

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2022)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13721))

Included in the following conference series:

  • 1238 Accesses

Abstract

Voice and face recognition are becoming omnipresent, and the need for secure biometric technologies increases as technologies like deepfake are making it increasingly harder to spot fake generated content. To improve current audio spoofing detection, we propose a curated selection of wavelet transforms based-models where, instead of the widely employed acoustic features, the Mel-spectrogram image features are decomposed through multiresolution decomposition analysis to better handle spectral information. For that, we adopt the use of median-filtering harmonic percussive source separation (HPSS), and perform a large-scale study on the application of several recent state-of-the-art computer vision models on audio anti-spoofing detection. These wavelet transforms are experimentally found to be very useful and lead to a notable performance of 4.8% EER on the ASVspoof2019 challenge logical access (LA) evaluation set. Finally, a more adversarialy robust WaveletCNN-based model is proposed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Models and full code of our experiments are available at: https://github.com/fathana/ARWaveletCNN.

References

  1. Alzantot, M., Wang, Z., et al.: Deep residual neural networks for audio spoofing detection. arXiv preprint arXiv:1907.00501 (2019)

  2. Aravind, P., Nechiyil, U., Paramparambath, N., et al.: Audio spoofing verification using deep convolutional neural networks by transfer learning. arXiv preprint arXiv:2008.03464 (2020)

  3. Cai, W., et al.: The DKU replay detection system for the asvspoof 2019 challenge: on data augmentation, feature representation, classification, and fusion. arXiv preprint arXiv:1907.02663 (2019)

  4. Chang, S.Y., et al.: Transfer-representation learning for detecting spoofing attacks with converted and synthesized speech in automatic speaker verification system. Transfer 51, 2 (2019)

    Google Scholar 

  5. Chiu, C.C., et al.: State-of-the-art speech recognition with sequence-to-sequence models. CoRR abs/1712.01769 (2017). http://arxiv.org/abs/1712.01769

  6. Consortium, A.: ASVspoof 2019: Automatic Speaker Verification Spoofing and Countermeasures Challenge Evaluation Plan (2019). https://www.asvspoof.org/asvspoof2019/asvspoof2019_evaluation_plan.pdf

  7. Das, R.K., Yang, J., Li, H.: Long range acoustic and deep features perspective on asvspoof 2019. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 1018–1025. IEEE (2019)

    Google Scholar 

  8. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  9. Driedger, J., Müller, M., Disch, S.: Extending harmonic-percussive separation of audio signals. In: ISMIR, pp. 611–616 (2014)

    Google Scholar 

  10. Elsayed, G.F., et al.: Large margin deep networks for classification. Adv. Neural Inf. Process. Syst. 32 (2018)

    Google Scholar 

  11. Feng, Z., Tong, Q., Long, Y., Wei, S., Yang, C., Zhang, Q.: SHNU anti-spoofing systems for asvspoof 2019 challenge. In: 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 548–552. IEEE (2019)

    Google Scholar 

  12. Fitzgerald, D.: Harmonic/percussive separation using median filtering. In: Proceedings of the International Conference on DAFx, vol. 13 (2010)

    Google Scholar 

  13. Fujieda, S., et al.: Wavelet convolutional neural networks. arXiv preprint arXiv:1805.08620 (2018)

  14. Hershey, S., et al.: CNN architectures for large-scale audio classification. CoRR abs/1609.09430 (2016). http://arxiv.org/abs/1609.09430

  15. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on CVPR, pp. 7132–7141 (2018)

    Google Scholar 

  16. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on CVPR, pp. 4700–4708 (2017)

    Google Scholar 

  17. Janicki, A.: Increasing anti-spoofing protection in speaker verification using linear prediction. Multimedia Tools Appl. 76(6), pp. 9017–9032 (2017)

    Google Scholar 

  18. Juvela, L., et al.: Speech waveform synthesis from MFCC sequences with generative adversarial networks. In: Proceedings of the IEEE ICASSP, pp. 5679–5683. IEEE (2018)

    Google Scholar 

  19. Lai, C.I., et al.: Assert: anti-spoofing with squeeze-excitation and residual networks. arXiv preprint arXiv:1904.01120 (2019)

  20. Lavrentyeva, G., Novoselov, S., Tseren, A., Volkova, M., Gorlanov, A., Kozlov, A.: STC antispoofing systems for the asvspoof2019 challenge. arXiv preprint arXiv:1904.05576 (2019)

  21. Lee, G., Gommers, R., Waselewski, F., Wohlfahrt, K., O’Leary, A.: Pywavelets: a python package for wavelet analysis. J. Open Sour. Softw. 4(36), 1237 (2019)

    Article  Google Scholar 

  22. Li, Q., Shen, L., et al.: Wavelet integrated CNNS for noise-robust image classification. In: Proceedings of the IEEE/CVF Conference on CVPR, pp. 7245–7254 (2020)

    Google Scholar 

  23. Liu, J.W., Zuo, F.L., Guo, Y.X., Li, T.Y., Chen, J.M.: Research on improved wavelet convolutional wavelet neural networks. Appl. Intell. 51(6), 4106–4126 (2021)

    Article  Google Scholar 

  24. Liu, P., et al.: Multi-level wavelet convolutional neural networks. IEEE Access 7, 74973–74985 (2019)

    Article  Google Scholar 

  25. Liu, W., Wen, Y., et al.: Large-margin softmax loss for convolutional neural networks. In: ICML, vol. 2 (2016)

    Google Scholar 

  26. Liu, X., Cheng, M., Zhang, H., Hsieh, C.J.: Towards robust neural networks via random self-ensemble. In: Proceedings of the ECCV, pp. 369–385 (2018)

    Google Scholar 

  27. Liu, Z., Lin, Y., Cao, Y., Hu, H., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)

    Google Scholar 

  28. Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. arXiv preprint arXiv:2201.03545 (2022)

  29. Mallat, S.G.: A theory for multiresolution signal decomposition: the wavelet representation. In: Fundamental Papers in Wavelet Theory, pp. 494–513. Princeton University Press (2009)

    Google Scholar 

  30. Monteiro, J., et al.: An ensemble based approach for generalized detection of spoofing attacks to automatic speaker recognizers. In: Proceedings of the ICASSP, pp. 6599–6603. IEEE (2020)

    Google Scholar 

  31. Nguyen, T.T., et al.: Deep learning for deepfakes creation and detection: a survey (2020)

    Google Scholar 

  32. Oord, A.V.D., et al.: Wavenet: a generative model for raw audio. CoRR abs/1609.03499 (2016). http://arxiv.org/abs/1609.03499

  33. Prahallad, K.: Spectrogram, cepstrum and mel-frequency analysis (2015)

    Google Scholar 

  34. Ren, K., Zheng, T., Qin, Z., Liu, X.: Adversarial attacks and defenses in deep learning. Engineering 6(3), 346–360 (2020)

    Article  Google Scholar 

  35. Shen, J., et al.: Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. CoRR abs/1712.05884 (2017). http://arxiv.org/abs/1712.05884

  36. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  37. Tak, H., et al.: End-to-end anti-spoofing with rawnet2. In: Proceedings of the ICASSP, pp. 6369–6373. IEEE (2021)

    Google Scholar 

  38. Tan, M., Le, Q.: Efficientnetv2: smaller models and faster training. In: Proceedings of ICML, pp. 10096–10106. PMLR (2021)

    Google Scholar 

  39. Todisco, M., et al.: A new feature for automatic speaker verification anti-spoofing: constant q cepstral coefficients. In: Odyssey, vol. 2016, pp. 283–290 (2016)

    Google Scholar 

  40. Tolstikhin, I.O., Houlsby, N., Kolesnikov, A., Beyer, L., et al.: Mlp-mixer: An all-mlp architecture for vision. Adva. Neural Inf. Process. Syst. 34 (2021)

    Google Scholar 

  41. Tom, F., et al.: End-to-end audio replay attack detection using deep convolutional networks with attention. In: Interspeech, pp. 681–685 (2018)

    Google Scholar 

  42. Wang, F., et al.: Additive margin softmax for face verification. IEEE Sign. Process. Lett. 25(7), 926–930 (2018)

    Article  Google Scholar 

  43. Wang, X., et al.: Asvspoof 2019: a large-scale public database of synthesized, converted and replayed speech. Comput. Speech Lang. 64, 101114 (2020)

    Google Scholar 

  44. Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on CVPR, pp. 1492–1500 (2017)

    Google Scholar 

  45. Yamagishi, J., Veaux, C., MacDonald, K.: CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92) (2019). https://doi.org/10.7488/ds/2645

  46. Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition. In: Proceedings of the IEEE Conference on CVPR, pp. 8697–8710 (2018)

    Google Scholar 

Download references

Acknowledgments

The authors wish to acknowledge the funding from the Natural Sciences and Engineering Research Council of Canada (NSERC) through grant RGPIN-2019-05381 and Ministry of Economy and Innovation (MEI) of the Government of Quebec for the continued support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jahangir Alam .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Fathan, A., Alam, J., Kang, W. (2022). Multiresolution Decomposition Analysis via Wavelet Transforms for Audio Deepfake Detection. In: Prasanna, S.R.M., Karpov, A., Samudravijaya, K., Agrawal, S.S. (eds) Speech and Computer. SPECOM 2022. Lecture Notes in Computer Science(), vol 13721. Springer, Cham. https://doi.org/10.1007/978-3-031-20980-2_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20980-2_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20979-6

  • Online ISBN: 978-3-031-20980-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics