Abstract
Voice and face recognition are becoming omnipresent, and the need for secure biometric technologies increases as technologies like deepfake are making it increasingly harder to spot fake generated content. To improve current audio spoofing detection, we propose a curated selection of wavelet transforms based-models where, instead of the widely employed acoustic features, the Mel-spectrogram image features are decomposed through multiresolution decomposition analysis to better handle spectral information. For that, we adopt the use of median-filtering harmonic percussive source separation (HPSS), and perform a large-scale study on the application of several recent state-of-the-art computer vision models on audio anti-spoofing detection. These wavelet transforms are experimentally found to be very useful and lead to a notable performance of 4.8% EER on the ASVspoof2019 challenge logical access (LA) evaluation set. Finally, a more adversarialy robust WaveletCNN-based model is proposed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Models and full code of our experiments are available at: https://github.com/fathana/ARWaveletCNN.
References
Alzantot, M., Wang, Z., et al.: Deep residual neural networks for audio spoofing detection. arXiv preprint arXiv:1907.00501 (2019)
Aravind, P., Nechiyil, U., Paramparambath, N., et al.: Audio spoofing verification using deep convolutional neural networks by transfer learning. arXiv preprint arXiv:2008.03464 (2020)
Cai, W., et al.: The DKU replay detection system for the asvspoof 2019 challenge: on data augmentation, feature representation, classification, and fusion. arXiv preprint arXiv:1907.02663 (2019)
Chang, S.Y., et al.: Transfer-representation learning for detecting spoofing attacks with converted and synthesized speech in automatic speaker verification system. Transfer 51, 2 (2019)
Chiu, C.C., et al.: State-of-the-art speech recognition with sequence-to-sequence models. CoRR abs/1712.01769 (2017). http://arxiv.org/abs/1712.01769
Consortium, A.: ASVspoof 2019: Automatic Speaker Verification Spoofing and Countermeasures Challenge Evaluation Plan (2019). https://www.asvspoof.org/asvspoof2019/asvspoof2019_evaluation_plan.pdf
Das, R.K., Yang, J., Li, H.: Long range acoustic and deep features perspective on asvspoof 2019. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 1018–1025. IEEE (2019)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Driedger, J., Müller, M., Disch, S.: Extending harmonic-percussive separation of audio signals. In: ISMIR, pp. 611–616 (2014)
Elsayed, G.F., et al.: Large margin deep networks for classification. Adv. Neural Inf. Process. Syst. 32 (2018)
Feng, Z., Tong, Q., Long, Y., Wei, S., Yang, C., Zhang, Q.: SHNU anti-spoofing systems for asvspoof 2019 challenge. In: 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 548–552. IEEE (2019)
Fitzgerald, D.: Harmonic/percussive separation using median filtering. In: Proceedings of the International Conference on DAFx, vol. 13 (2010)
Fujieda, S., et al.: Wavelet convolutional neural networks. arXiv preprint arXiv:1805.08620 (2018)
Hershey, S., et al.: CNN architectures for large-scale audio classification. CoRR abs/1609.09430 (2016). http://arxiv.org/abs/1609.09430
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on CVPR, pp. 7132–7141 (2018)
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on CVPR, pp. 4700–4708 (2017)
Janicki, A.: Increasing anti-spoofing protection in speaker verification using linear prediction. Multimedia Tools Appl. 76(6), pp. 9017–9032 (2017)
Juvela, L., et al.: Speech waveform synthesis from MFCC sequences with generative adversarial networks. In: Proceedings of the IEEE ICASSP, pp. 5679–5683. IEEE (2018)
Lai, C.I., et al.: Assert: anti-spoofing with squeeze-excitation and residual networks. arXiv preprint arXiv:1904.01120 (2019)
Lavrentyeva, G., Novoselov, S., Tseren, A., Volkova, M., Gorlanov, A., Kozlov, A.: STC antispoofing systems for the asvspoof2019 challenge. arXiv preprint arXiv:1904.05576 (2019)
Lee, G., Gommers, R., Waselewski, F., Wohlfahrt, K., O’Leary, A.: Pywavelets: a python package for wavelet analysis. J. Open Sour. Softw. 4(36), 1237 (2019)
Li, Q., Shen, L., et al.: Wavelet integrated CNNS for noise-robust image classification. In: Proceedings of the IEEE/CVF Conference on CVPR, pp. 7245–7254 (2020)
Liu, J.W., Zuo, F.L., Guo, Y.X., Li, T.Y., Chen, J.M.: Research on improved wavelet convolutional wavelet neural networks. Appl. Intell. 51(6), 4106–4126 (2021)
Liu, P., et al.: Multi-level wavelet convolutional neural networks. IEEE Access 7, 74973–74985 (2019)
Liu, W., Wen, Y., et al.: Large-margin softmax loss for convolutional neural networks. In: ICML, vol. 2 (2016)
Liu, X., Cheng, M., Zhang, H., Hsieh, C.J.: Towards robust neural networks via random self-ensemble. In: Proceedings of the ECCV, pp. 369–385 (2018)
Liu, Z., Lin, Y., Cao, Y., Hu, H., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. arXiv preprint arXiv:2201.03545 (2022)
Mallat, S.G.: A theory for multiresolution signal decomposition: the wavelet representation. In: Fundamental Papers in Wavelet Theory, pp. 494–513. Princeton University Press (2009)
Monteiro, J., et al.: An ensemble based approach for generalized detection of spoofing attacks to automatic speaker recognizers. In: Proceedings of the ICASSP, pp. 6599–6603. IEEE (2020)
Nguyen, T.T., et al.: Deep learning for deepfakes creation and detection: a survey (2020)
Oord, A.V.D., et al.: Wavenet: a generative model for raw audio. CoRR abs/1609.03499 (2016). http://arxiv.org/abs/1609.03499
Prahallad, K.: Spectrogram, cepstrum and mel-frequency analysis (2015)
Ren, K., Zheng, T., Qin, Z., Liu, X.: Adversarial attacks and defenses in deep learning. Engineering 6(3), 346–360 (2020)
Shen, J., et al.: Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. CoRR abs/1712.05884 (2017). http://arxiv.org/abs/1712.05884
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Tak, H., et al.: End-to-end anti-spoofing with rawnet2. In: Proceedings of the ICASSP, pp. 6369–6373. IEEE (2021)
Tan, M., Le, Q.: Efficientnetv2: smaller models and faster training. In: Proceedings of ICML, pp. 10096–10106. PMLR (2021)
Todisco, M., et al.: A new feature for automatic speaker verification anti-spoofing: constant q cepstral coefficients. In: Odyssey, vol. 2016, pp. 283–290 (2016)
Tolstikhin, I.O., Houlsby, N., Kolesnikov, A., Beyer, L., et al.: Mlp-mixer: An all-mlp architecture for vision. Adva. Neural Inf. Process. Syst. 34 (2021)
Tom, F., et al.: End-to-end audio replay attack detection using deep convolutional networks with attention. In: Interspeech, pp. 681–685 (2018)
Wang, F., et al.: Additive margin softmax for face verification. IEEE Sign. Process. Lett. 25(7), 926–930 (2018)
Wang, X., et al.: Asvspoof 2019: a large-scale public database of synthesized, converted and replayed speech. Comput. Speech Lang. 64, 101114 (2020)
Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on CVPR, pp. 1492–1500 (2017)
Yamagishi, J., Veaux, C., MacDonald, K.: CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92) (2019). https://doi.org/10.7488/ds/2645
Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition. In: Proceedings of the IEEE Conference on CVPR, pp. 8697–8710 (2018)
Acknowledgments
The authors wish to acknowledge the funding from the Natural Sciences and Engineering Research Council of Canada (NSERC) through grant RGPIN-2019-05381 and Ministry of Economy and Innovation (MEI) of the Government of Quebec for the continued support.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Fathan, A., Alam, J., Kang, W. (2022). Multiresolution Decomposition Analysis via Wavelet Transforms for Audio Deepfake Detection. In: Prasanna, S.R.M., Karpov, A., Samudravijaya, K., Agrawal, S.S. (eds) Speech and Computer. SPECOM 2022. Lecture Notes in Computer Science(), vol 13721. Springer, Cham. https://doi.org/10.1007/978-3-031-20980-2_17
Download citation
DOI: https://doi.org/10.1007/978-3-031-20980-2_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20979-6
Online ISBN: 978-3-031-20980-2
eBook Packages: Computer ScienceComputer Science (R0)