Abstract
Recent advancements in research for development of countermeasure systems for Spoofed Audio detection has helped in building more robust Automatic Speaker Verification (ASV) System. However, available countermeasure systems are not able to generalize well against unknown attacks. The lack of context-dependent information extracted from the given speech at fine grained level is the dominating reason for poor performance of these systems against unknown attacks. To build a noise robust anti-spoof system, in this paper, we propose a Time Delay Neural Network (TDNN)-based countermeasure system that captures context-dependent information well. We devise a three-stage design where at first audio is pre-processed to extract useful information using three different types of features, that are, Mel Frequency Cepstral Coefficients (MFCC), noise robust Gammatone Cepstral Coefficients (GTCC) features and integration of MFCC-GTCC features. These features are then input to proposed Deep Neural Network (DNN) model that uses Long Short-Term Memory (LSTM) network for recurrent aggregation of layer wise generated shallow features in TDNN. Finally, the output is passed through context-dependent pooling layer to generate fixed-length representation that is further used at third stage to classify speech as genuine or spoofed. The proposed system is tested on Logical Access (LA) track of ASV Spoof 2019 dataset, and achieves performance improvement of about 59.7% and 65.9% relative to earlier proposed Linear-Frequency Cepstral Coefficients-Gaussian Mixture Model (LFCC-GMM) and Constant Q Cepstral Coefficients-Gaussian Mixture Model (CQCC-GMM) baseline models, respectively.
Similar content being viewed by others
Data availability
All the data generated or analyzed during this study are included and referred to in this published article.
References
Tak H, Todisco M, Wang X, Jung J, Yamagishi J, Evans N. Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation. 2022. arXiv Prepr. arXiv2202.12233
Wu Z, Evans N, Kinnunen T, Yamagishi J, Alegre F, Li H. Spoofing and countermeasures for speaker verification: a survey. Speech Commun. 2015;66:130–53.
Wu Z, et al. ASVspoof: the automatic speaker verification spoofing and countermeasures challenge. IEEE J Sel Top Signal Process. 2017;11(4):588–604. https://doi.org/10.1109/JSTSP.2017.2671435.
Yamagishi J et al. Asvspoof 2019: the 3rd automatic speaker verification spoofing and countermeasures challenge database. 2019.
Wu Z, Gao S, Cling ES, Li H. A study on replay attack and anti-spoofing for text-dependent speaker verification. Signal Inf Process Assoc Annu Summit Conf (APSIPA) Asia-Pac. 2014. https://doi.org/10.1109/APSIPA.2014.7041636.
Hossan MA, Memon S, Gregory MA. A novel approach for MFCC feature extraction. Int Conf Signal Process Commun Syst. 2010. https://doi.org/10.1109/ICSPCS.2010.5709752.
Dave N. Feature extraction methods LPC, PLP and MFCC in speech recognition. Int J Adv Res Eng Technol. 2013;1:2320–6802.
Todisco M, Delgado H, Evans N. Constant Q Cepstral coefficients: a spoofing countermeasure for automatic speaker verification. Comput Speech Lang. 2017. https://doi.org/10.1016/j.csl.2017.01.001.
Todisco M, Delgado H, Evans NWD. A new feature for automatic speaker verification anti-spoofing: constant Q cepstral coefficients. Odyssey. 2016;2016:283–90.
Valero X, Alías F. Gammatone cepstral coefficients: biologically inspired features for non-speech audio classification. Multimed IEEE Trans. 2012;14:1684–9. https://doi.org/10.1109/TMM.2012.2199972.
Ge W, Tak H, Todisco M, Evans N. On the potential of jointly-optimised solutions to spoofing attack detection and automatic speaker verification. 2022. arXiv Prepr. arXiv2209.00506
Liu H, Zhao L. A speaker verification method based on TDNN–LSTMP. Circuits Syst Signal Process. 2019;38(10):4840–54.
Snyder D, Garcia-Romero D, Sell G, Povey D, Khudanpur S. X-vectors: robust dnn embeddings for speaker recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5329–33.
Snyder D, Garcia-Romero D, Sell G, McCree A, Povey D, Khudanpur S. Speaker recognition for multi-speaker conversations using x-vectors. In: ICASSP 2019–2019 IEEE International conference on acoustics, speech and signal processing (ICASSP), 2019, pp. 5796–800.
Qin Y, Du J, Wang X, Lu H. Recurrent layer aggregation using LSTM. In: 2019 International Joint Conference on Neural Networks (IJCNN), 2019, pp. 1–8.
Kumar MG, Kumar SR, Saranya MS, Bharathi B, Murthy HA. Spoof detection using time-delay shallow neural network and feature switching. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 1011–17.
Zhang X, Zhang X, Zou X, Liu H, Sun M. Towards generating adversarial examples on combined systems of automatic speaker verification and spoofing countermeasure. Secur Commun Netw. 2022;2022:2666534. https://doi.org/10.1155/2022/2666534.
Ray R, et al. Feature genuinization based residual squeeze-and-excitation for audio anti-spoofing in sound AI. Int Conf Comput Commun Netw Technol (ICCCNT). 2021. https://doi.org/10.1109/ICCCNT51525.2021.9580127.
Wang Z, Cui S, Kang X, Sun W, Li Z. Densely connected convolutional network for audio spoofing detection. In: 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2020, pp. 1352–60.
Mittal A, Dua M. Automatic speaker verification system using three dimensional static and contextual variation-based features with two dimensional convolutional neural network. Int J Swarm Intell. 2021;6(2):143–53.
Mittal A, Dua M. Constant Q cepstral coefficients and long short-term memory model-based automatic speaker verification system. In: Proceedings of International Conference on Intelligent Computing, Information and Control Systems, 2021, pp. 895–904.
Lv Z, Zhang S, Tang K, Hu P. Fake audio detection based on unsupervised pretraining models. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 9231–5.
. Hassan F, Javed A. Voice spoofing countermeasure for synthetic speech detection. In: 2021 International Conference on Artificial Intelligence (ICAI), 2021, pp. 209–12.
Rupesh Kumar S, Bharathi B. Generative and discriminative modelling of linear energy sub-bands for spoof detection in speaker verification systems. Circuits Syst Signal Process. 2022;41(7):3811–31.
Chen T, Kumar A, Nagarsheth P, Sivaraman G, Khoury E. Generalization of audio deepfake detection. In: Proceedings of Odyssey 2020 The Speaker and Language Recognition Workshop, 2020, pp. 132–7.
Barai B, Basu S, Nasipuri M, Das D, Das N. VQ/GMM based speaker identification with emphasis on language dependency. 2018.
Fu Z, Lu G, Ting KM, Zhang D. A survey of audio-based music classification and annotation. IEEE Trans Multimed. 2010;13(2):303–19.
Cheng O, Abdulla W, Salcic Z. Performance evaluation of front-end algorithms for robust speech recognition. Proc Eighth Int Symp Signal Process Appl. 2005;2:711–4. https://doi.org/10.1109/ISSPA.2005.1581037.
Li et al. X. Replay and synthetic speech detection with res2net architecture. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6354–8.
Wang X, et al. ASVspoof 2019: a large-scale public database of synthetized, converted and replayed speech. Comput Speech Lang. 2020;64:101114. https://doi.org/10.1016/j.csl.2020.101114.
Desplanques B, Thienpondt J, Demuynck K. Ecapa-tdnn: emphasized channel attention, propagation and aggregation in tdnn based speaker verification. 2020. arXiv Prepr. arXiv2005.07143
Dua M, Sadhu A, Jindal A, Mehta R. A hybrid noise robust model for multireplay attack detection in automatic speaker verification systems. Biomed Signal Process Control. 2022;74:103517. https://doi.org/10.1016/j.bspc.2022.103517.
Funding
I, Dr. Mohit Dua, on the behalf of all the authors declare that this study did not receive any funding from any resource.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare that submitted manuscript have no conflict of interest.
Ethical Approval
This research article does not contain any studies with human participants or animals performed by any of the authors.
Human and Animal Rights
This article does not contain any studies with human participants or animals performed by any of the authors.
Informed Consent
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article is part of the topical collection “Enabling Innovative Computational Intelligence Technologies for IOT” guest edited by Omer Rana, Rajiv Misra, Alexander Pfeiffer, Luigi Troiano and Nishtha Kesswani.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chakravarty, N., Dua, M. Noise Robust ASV Spoof Detection Using Integrated Features and Time Delay Neural Network. SN COMPUT. SCI. 4, 127 (2023). https://doi.org/10.1007/s42979-022-01557-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-022-01557-4