Multiresolution Decomposition Analysis via Wavelet Transforms for Audio Deepfake Detection

Fathan, Abderrahim; Alam, Jahangir; Kang, Woohyun

doi:10.1007/978-3-031-20980-2_17

Abderrahim Fathan¹¹,
Jahangir Alam¹¹ &
Woohyun Kang¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13721))

Included in the following conference series:

International Conference on Speech and Computer

1238 Accesses

Abstract

Voice and face recognition are becoming omnipresent, and the need for secure biometric technologies increases as technologies like deepfake are making it increasingly harder to spot fake generated content. To improve current audio spoofing detection, we propose a curated selection of wavelet transforms based-models where, instead of the widely employed acoustic features, the Mel-spectrogram image features are decomposed through multiresolution decomposition analysis to better handle spectral information. For that, we adopt the use of median-filtering harmonic percussive source separation (HPSS), and perform a large-scale study on the application of several recent state-of-the-art computer vision models on audio anti-spoofing detection. These wavelet transforms are experimentally found to be very useful and lead to a notable performance of 4.8% EER on the ASVspoof2019 challenge logical access (LA) evaluation set. Finally, a more adversarialy robust WaveletCNN-based model is proposed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Wavelet-packets for deepfake image analysis and detection

Article Open access 31 August 2022

Deepfake face detection via multi-level discrete wavelet transform and vision transformer

Article 23 January 2025

Efficient human face recognition in real-life applications using the discrete wavelet transformation (HFRDWT)

Article 28 December 2023

Notes

1.
Models and full code of our experiments are available at: https://github.com/fathana/ARWaveletCNN.

References

Alzantot, M., Wang, Z., et al.: Deep residual neural networks for audio spoofing detection. arXiv preprint arXiv:1907.00501 (2019)
Aravind, P., Nechiyil, U., Paramparambath, N., et al.: Audio spoofing verification using deep convolutional neural networks by transfer learning. arXiv preprint arXiv:2008.03464 (2020)
Cai, W., et al.: The DKU replay detection system for the asvspoof 2019 challenge: on data augmentation, feature representation, classification, and fusion. arXiv preprint arXiv:1907.02663 (2019)
Chang, S.Y., et al.: Transfer-representation learning for detecting spoofing attacks with converted and synthesized speech in automatic speaker verification system. Transfer 51, 2 (2019)
Google Scholar
Chiu, C.C., et al.: State-of-the-art speech recognition with sequence-to-sequence models. CoRR abs/1712.01769 (2017). http://arxiv.org/abs/1712.01769
Consortium, A.: ASVspoof 2019: Automatic Speaker Verification Spoofing and Countermeasures Challenge Evaluation Plan (2019). https://www.asvspoof.org/asvspoof2019/asvspoof2019_evaluation_plan.pdf
Das, R.K., Yang, J., Li, H.: Long range acoustic and deep features perspective on asvspoof 2019. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 1018–1025. IEEE (2019)
Google Scholar
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., et al.: An image is worth 16 $\times $ 16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Driedger, J., Müller, M., Disch, S.: Extending harmonic-percussive separation of audio signals. In: ISMIR, pp. 611–616 (2014)
Google Scholar
Elsayed, G.F., et al.: Large margin deep networks for classification. Adv. Neural Inf. Process. Syst. 32 (2018)
Google Scholar
Feng, Z., Tong, Q., Long, Y., Wei, S., Yang, C., Zhang, Q.: SHNU anti-spoofing systems for asvspoof 2019 challenge. In: 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 548–552. IEEE (2019)
Google Scholar
Fitzgerald, D.: Harmonic/percussive separation using median filtering. In: Proceedings of the International Conference on DAFx, vol. 13 (2010)
Google Scholar
Fujieda, S., et al.: Wavelet convolutional neural networks. arXiv preprint arXiv:1805.08620 (2018)
Hershey, S., et al.: CNN architectures for large-scale audio classification. CoRR abs/1609.09430 (2016). http://arxiv.org/abs/1609.09430
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on CVPR, pp. 7132–7141 (2018)
Google Scholar
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on CVPR, pp. 4700–4708 (2017)
Google Scholar
Janicki, A.: Increasing anti-spoofing protection in speaker verification using linear prediction. Multimedia Tools Appl. 76(6), pp. 9017–9032 (2017)
Google Scholar
Juvela, L., et al.: Speech waveform synthesis from MFCC sequences with generative adversarial networks. In: Proceedings of the IEEE ICASSP, pp. 5679–5683. IEEE (2018)
Google Scholar
Lai, C.I., et al.: Assert: anti-spoofing with squeeze-excitation and residual networks. arXiv preprint arXiv:1904.01120 (2019)
Lavrentyeva, G., Novoselov, S., Tseren, A., Volkova, M., Gorlanov, A., Kozlov, A.: STC antispoofing systems for the asvspoof2019 challenge. arXiv preprint arXiv:1904.05576 (2019)
Lee, G., Gommers, R., Waselewski, F., Wohlfahrt, K., O’Leary, A.: Pywavelets: a python package for wavelet analysis. J. Open Sour. Softw. 4(36), 1237 (2019)
Article Google Scholar
Li, Q., Shen, L., et al.: Wavelet integrated CNNS for noise-robust image classification. In: Proceedings of the IEEE/CVF Conference on CVPR, pp. 7245–7254 (2020)
Google Scholar
Liu, J.W., Zuo, F.L., Guo, Y.X., Li, T.Y., Chen, J.M.: Research on improved wavelet convolutional wavelet neural networks. Appl. Intell. 51(6), 4106–4126 (2021)
Article Google Scholar
Liu, P., et al.: Multi-level wavelet convolutional neural networks. IEEE Access 7, 74973–74985 (2019)
Article Google Scholar
Liu, W., Wen, Y., et al.: Large-margin softmax loss for convolutional neural networks. In: ICML, vol. 2 (2016)
Google Scholar
Liu, X., Cheng, M., Zhang, H., Hsieh, C.J.: Towards robust neural networks via random self-ensemble. In: Proceedings of the ECCV, pp. 369–385 (2018)
Google Scholar
Liu, Z., Lin, Y., Cao, Y., Hu, H., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Google Scholar
Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. arXiv preprint arXiv:2201.03545 (2022)
Mallat, S.G.: A theory for multiresolution signal decomposition: the wavelet representation. In: Fundamental Papers in Wavelet Theory, pp. 494–513. Princeton University Press (2009)
Google Scholar
Monteiro, J., et al.: An ensemble based approach for generalized detection of spoofing attacks to automatic speaker recognizers. In: Proceedings of the ICASSP, pp. 6599–6603. IEEE (2020)
Google Scholar
Nguyen, T.T., et al.: Deep learning for deepfakes creation and detection: a survey (2020)
Google Scholar
Oord, A.V.D., et al.: Wavenet: a generative model for raw audio. CoRR abs/1609.03499 (2016). http://arxiv.org/abs/1609.03499
Prahallad, K.: Spectrogram, cepstrum and mel-frequency analysis (2015)
Google Scholar
Ren, K., Zheng, T., Qin, Z., Liu, X.: Adversarial attacks and defenses in deep learning. Engineering 6(3), 346–360 (2020)
Article Google Scholar
Shen, J., et al.: Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. CoRR abs/1712.05884 (2017). http://arxiv.org/abs/1712.05884
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Tak, H., et al.: End-to-end anti-spoofing with rawnet2. In: Proceedings of the ICASSP, pp. 6369–6373. IEEE (2021)
Google Scholar
Tan, M., Le, Q.: Efficientnetv2: smaller models and faster training. In: Proceedings of ICML, pp. 10096–10106. PMLR (2021)
Google Scholar
Todisco, M., et al.: A new feature for automatic speaker verification anti-spoofing: constant q cepstral coefficients. In: Odyssey, vol. 2016, pp. 283–290 (2016)
Google Scholar
Tolstikhin, I.O., Houlsby, N., Kolesnikov, A., Beyer, L., et al.: Mlp-mixer: An all-mlp architecture for vision. Adva. Neural Inf. Process. Syst. 34 (2021)
Google Scholar
Tom, F., et al.: End-to-end audio replay attack detection using deep convolutional networks with attention. In: Interspeech, pp. 681–685 (2018)
Google Scholar
Wang, F., et al.: Additive margin softmax for face verification. IEEE Sign. Process. Lett. 25(7), 926–930 (2018)
Article Google Scholar
Wang, X., et al.: Asvspoof 2019: a large-scale public database of synthesized, converted and replayed speech. Comput. Speech Lang. 64, 101114 (2020)
Google Scholar
Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on CVPR, pp. 1492–1500 (2017)
Google Scholar
Yamagishi, J., Veaux, C., MacDonald, K.: CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92) (2019). https://doi.org/10.7488/ds/2645
Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition. In: Proceedings of the IEEE Conference on CVPR, pp. 8697–8710 (2018)
Google Scholar

Download references

Acknowledgments

The authors wish to acknowledge the funding from the Natural Sciences and Engineering Research Council of Canada (NSERC) through grant RGPIN-2019-05381 and Ministry of Economy and Innovation (MEI) of the Government of Quebec for the continued support.

Author information

Authors and Affiliations

Computer Research Institute of Montreal, Montreal, (Quebec), H3N 1M3, Canada
Abderrahim Fathan, Jahangir Alam & Woohyun Kang

Authors

Abderrahim Fathan
View author publications
You can also search for this author in PubMed Google Scholar
Jahangir Alam
View author publications
You can also search for this author in PubMed Google Scholar
Woohyun Kang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jahangir Alam .

Editor information

Editors and Affiliations

Indian Institute of Technology Dharwad, Dharwad, India
S. R. Mahadeva Prasanna
St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Koneru Lakshmaiah Education Foundation, Vaddeswaram, India
K. Samudravijaya
KIIT Group of Colleges, Gurugram, India
Shyam S. Agrawal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fathan, A., Alam, J., Kang, W. (2022). Multiresolution Decomposition Analysis via Wavelet Transforms for Audio Deepfake Detection. In: Prasanna, S.R.M., Karpov, A., Samudravijaya, K., Agrawal, S.S. (eds) Speech and Computer. SPECOM 2022. Lecture Notes in Computer Science(), vol 13721. Springer, Cham. https://doi.org/10.1007/978-3-031-20980-2_17

Download citation

DOI: https://doi.org/10.1007/978-3-031-20980-2_17
Published: 10 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20979-6
Online ISBN: 978-3-031-20980-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics